Difference between revisions of "Peregrine Usage"
m (added subprojects information) |
m (added sample program) |
||
Line 235: | Line 235: | ||
<tt>peregrine-client</tt> is an example of using Peregine as a library in WebApplication. It provides simple JSP presentation of Peregine indexing results. As JSP is also a servlet, instead of speaking HTML it can produce XML (and act as REST service). | <tt>peregrine-client</tt> is an example of using Peregine as a library in WebApplication. It provides simple JSP presentation of Peregine indexing results. As JSP is also a servlet, instead of speaking HTML it can produce XML (and act as REST service). | ||
+ | |||
+ | ==== Not using maven & spring==== | ||
+ | |||
+ | You have to download the [http://ws1.grid.sara.nl:21501/artifactory/libs-releases/org/erasmusmc/data-mining/release/ release package], which contains all necessary libraries. After adding the dependencies you needed to your project, you can program a simple scenario: | ||
+ | |||
+ | <pre> | ||
+ | import java.util.List; | ||
+ | |||
+ | import org.erasmusmc.data_mining.ontology.api.Language; | ||
+ | import org.erasmusmc.data_mining.ontology.api.Ontology; | ||
+ | import org.erasmusmc.data_mining.ontology.impl.file.SingleFileOntologyImpl; | ||
+ | import org.erasmusmc.data_mining.peregrine.api.IndexingResult; | ||
+ | import org.erasmusmc.data_mining.peregrine.api.Peregrine; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.api.DisambiguationDecisionMaker; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.api.Disambiguator; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.api.RuleDisambiguator; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.impl.DisambiguatorImpl; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.impl.LooseDisambiguator; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.impl.StrictDisambiguator; | ||
+ | import org.erasmusmc.data_mining.peregrine.disambiguator.impl.ThresholdDisambiguationDecisionMakerImpl; | ||
+ | import org.erasmusmc.data_mining.peregrine.impl.hash.PeregrineImpl; | ||
+ | import org.erasmusmc.data_mining.peregrine.normalizer.api.NormalizerFactory; | ||
+ | import org.erasmusmc.data_mining.peregrine.normalizer.impl.LVGNormalizer; | ||
+ | import org.erasmusmc.data_mining.peregrine.normalizer.impl.NormalizerFactoryImpl; | ||
+ | import org.erasmusmc.data_mining.peregrine.tokenizer.api.TokenizerFactory; | ||
+ | import org.erasmusmc.data_mining.peregrine.tokenizer.impl.SBDtokenizer; | ||
+ | import org.erasmusmc.data_mining.peregrine.tokenizer.impl.TokenizerFactoryImpl; | ||
+ | |||
+ | public class Main { | ||
+ | |||
+ | public static void main(String[] args) { | ||
+ | Ontology ontology = new SingleFileOntologyImpl("/home/public/thesauri/my.ontology"); | ||
+ | |||
+ | TokenizerFactory tokenizerFactory = TokenizerFactoryImpl.createDefaultTokenizerFactory(new SBDtokenizer()); | ||
+ | NormalizerFactory normalizerFactory = NormalizerFactoryImpl.createDefaultNormalizerFactory(new LVGNormalizer( | ||
+ | "/home/public/LVG/lvg2006lite/data/config/lvg.properties")); | ||
+ | Disambiguator disambiguator = new DisambiguatorImpl(new RuleDisambiguator[] { new StrictDisambiguator(), | ||
+ | new LooseDisambiguator() }); | ||
+ | DisambiguationDecisionMaker disambiguationDecisionMaker = new ThresholdDisambiguationDecisionMakerImpl(); | ||
+ | |||
+ | Peregrine peregrne = new PeregrineImpl(ontology, tokenizerFactory, normalizerFactory, disambiguator, | ||
+ | disambiguationDecisionMaker); | ||
+ | |||
+ | String text = "I have super text with words like malaria and water"; | ||
+ | |||
+ | List<IndexingResult> indexingResults = peregrne.index(text, Language.EN); | ||
+ | |||
+ | for (IndexingResult indexingResult : indexingResults) { | ||
+ | System.out.println("Found termId: " + indexingResult.getTermId().getConceptId() + ", matched text: " | ||
+ | + text.substring(indexingResult.getStartPos(), indexingResult.getEndPos())); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
==== Profiling the memory usage of Peregrine ==== | ==== Profiling the memory usage of Peregrine ==== |
Revision as of 02:32, 20 March 2010
Contents
Peregrine usage examples
Using Peregrine as library via Java API
- Use this Maven repository to make Maven aware about Peregrine-related artifacts:
<repositories> <repository> <id>sara-artifactory-server-id</id> <name>SARA Artifactory - Peregrine Maven releases</name> <url>http://ws1.grid.sara.nl:21501/artifactory/libs-releases/</url> </repository> </repositories>
- The list of obligatory packages, that you need to include into your project is:
- peregrine-api (includes ontology-api and common-utils)
- peregrine-normalizer
- peregrine-tokenizer
- You need to decide, what will be your ontology provider. There are several options[1]:
- File source ontology (ontology-impl-file)
- DB source ontology (ontology-impl-db)
- You need to decide, whether you need to disambiguate text indexing results (usually, you do). If yes, you need to include peregrine-disambiguator project. Disambiguator layer can be used out of the box without special configuration.
- You have to decide, what Peregrine interface implementation you want to use. By the time of writing this article, there is only one implementation available: peregrine-impl-hash
Sample pom.xml configuration
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>mygroup</groupId> <artifactId>peregrine-client</artifactId> <packaging>jar</packaging> <version>0.2-SNAPSHOT</version> <name>Peregrine Sample Client</name> <inceptionYear>2009</inceptionYear> <repositories> <repository> <id>sara-artifactory-server-id</id> <url>http://ws1.grid.sara.nl:21501/artifactory/libs-releases</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.erasmusmc.data-mining.peregrine</groupId> <artifactId>peregrine-api</artifactId> <version>0.2-SNAPSHOT</version> </dependency> <dependency> <groupId>org.erasmusmc.data-mining.peregrine</groupId> <artifactId>peregrine-normalizer</artifactId> <version>0.2-SNAPSHOT</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.erasmusmc.data-mining.peregrine</groupId> <artifactId>peregrine-tokenizer</artifactId> <version>0.2-SNAPSHOT</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.erasmusmc.data-mining.peregrine</groupId> <artifactId>peregrine-disambiguator</artifactId> <version>0.2-SNAPSHOT</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.erasmusmc.data-mining.peregrine</groupId> <artifactId>peregrine-impl-hash</artifactId> <version>0.2-SNAPSHOT</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.erasmusmc.data-mining.ontology</groupId> <artifactId>ontology-impl-db</artifactId> <version>0.2-SNAPSHOT</version> <scope>runtime</scope> </dependency> </dependencies> </project>
Sample ontology-impl-file configuration
<?xml version="1.0"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd"> <bean name="ontology" class="org.erasmusmc.data_mining.ontology.impl.file.SingleFileOntologyImpl"> <constructor-arg value="file:/home/user/ontology_data.txt" /> </bean> </beans>
Sample ontology-impl-db configuration
<?xml version="1.0"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tx="http://www.springframework.org/schema/tx" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-2.5.xsd"> <bean id="ontology" class="org.erasmusmc.data_mining.ontology.impl.db.DBOntologyImpl" lazy-init="true"> <constructor-arg> <bean class="org.springframework.jdbc.core.simple.SimpleJdbcTemplate"> <constructor-arg ref="ontologyDataSource" /> </bean> </constructor-arg> </bean> <bean id="ontologyDataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close" scope="singleton" lazy-init="true"> <property name="driverClassName" value="com.mysql.jdbc.Driver" /> <property name="url" value="jdbc:mysql://myserver:3306/mydatabase?autoReconnect=true" /> <property name="username" value="dbuser" /> <property name="password" value="dbpass" /> <property name="validationQuery" value="select 1" /> </bean> <bean id="txManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager"> <property name="dataSource" ref="ontologyDataSource"/> </bean> <tx:annotation-driven transaction-manager="txManager" /> </beans>
Sample peregrine-impl-hash configuration
<?xml version="1.0"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd"> <bean name="thresholdDisambiguationDecisionMaker" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.ThresholdDisambiguationDecisionMakerImpl"> <property name="disambiguationMinimalWeight" value="50" /> <property name="disambiguationAlwaysAcceptedWeight" value="80" /> </bean> <bean name="ruleDisambiguator" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.DisambiguatorImpl"> <constructor-arg> <list> <ref local="looseDisambiguator" /> <ref local="strictDisambiguator" /> </list> </constructor-arg> </bean> <bean id="looseDisambiguator" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.LooseDisambiguator"> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsHomonymRule" /> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsPreferredTermRule" /> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.HasSynonymRule"> <property name="maxSynonymDistance" value="40" /> <property name="minSynonymWeight" value="75" /> <property name="maxSynonymWeight" value="80" /> </bean> </constructor-arg> </bean> <bean id="strictDisambiguator" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.StrictDisambiguator"> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsHomonymRule" /> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsPreferredTermRule" /> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsComplexRule"> <property name="maxTermLength" value="6" /> <property name="minTermLength" value="3" /> <property name="minTermNumbers" value="1" /> <property name="minTermLetters" value="1" /> </bean> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.HasSynonymRule"> <property name="maxSynonymDistance" value="40" /> <property name="minSynonymWeight" value="75" /> <property name="maxSynonymWeight" value="80" /> </bean> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.HasKeywordRule"> <property name="maxKeywordDistance" value="300" /> <property name="minKeywordWeight" value="75" /> <property name="maxKeywordWeight" value="80" /> </bean> </constructor-arg> </bean> <bean name="peregrine" class="org.erasmusmc.data_mining.peregrine.impl.hash.PeregrineImpl"> <constructor-arg ref="ontology" /> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.tokenizer.impl.SubSentenceTokenizer" /> </constructor-arg> <constructor-arg> <bean class="org.erasmusmc.data_mining.peregrine.normalizer.impl.LVGNormalizer" /> </constructor-arg> <constructor-arg ref="ruleDisambiguator" /> <constructor-arg ref="thresholdDisambiguationDecisionMaker" /> </bean> </beans>
Using one of existing Peregrine projects
peregrine-ws is an example of exposing Peregrine interface as WebService. This is done using JAX-WS technology. JAX-WS provides a build-in servlet that serves registered HTTP endpoints.
peregrine-rmi is a nice example of how Peregrine interface can be exposed as RMI using build-in Spring framework facilities. However, if all RMI-staff is removed, you get an example of some command-line peregrine utility that can take some file as an argument for example, index it and write to another file.
peregrine-client is an example of using Peregine as a library in WebApplication. It provides simple JSP presentation of Peregine indexing results. As JSP is also a servlet, instead of speaking HTML it can produce XML (and act as REST service).
Not using maven & spring
You have to download the release package, which contains all necessary libraries. After adding the dependencies you needed to your project, you can program a simple scenario:
import java.util.List; import org.erasmusmc.data_mining.ontology.api.Language; import org.erasmusmc.data_mining.ontology.api.Ontology; import org.erasmusmc.data_mining.ontology.impl.file.SingleFileOntologyImpl; import org.erasmusmc.data_mining.peregrine.api.IndexingResult; import org.erasmusmc.data_mining.peregrine.api.Peregrine; import org.erasmusmc.data_mining.peregrine.disambiguator.api.DisambiguationDecisionMaker; import org.erasmusmc.data_mining.peregrine.disambiguator.api.Disambiguator; import org.erasmusmc.data_mining.peregrine.disambiguator.api.RuleDisambiguator; import org.erasmusmc.data_mining.peregrine.disambiguator.impl.DisambiguatorImpl; import org.erasmusmc.data_mining.peregrine.disambiguator.impl.LooseDisambiguator; import org.erasmusmc.data_mining.peregrine.disambiguator.impl.StrictDisambiguator; import org.erasmusmc.data_mining.peregrine.disambiguator.impl.ThresholdDisambiguationDecisionMakerImpl; import org.erasmusmc.data_mining.peregrine.impl.hash.PeregrineImpl; import org.erasmusmc.data_mining.peregrine.normalizer.api.NormalizerFactory; import org.erasmusmc.data_mining.peregrine.normalizer.impl.LVGNormalizer; import org.erasmusmc.data_mining.peregrine.normalizer.impl.NormalizerFactoryImpl; import org.erasmusmc.data_mining.peregrine.tokenizer.api.TokenizerFactory; import org.erasmusmc.data_mining.peregrine.tokenizer.impl.SBDtokenizer; import org.erasmusmc.data_mining.peregrine.tokenizer.impl.TokenizerFactoryImpl; public class Main { public static void main(String[] args) { Ontology ontology = new SingleFileOntologyImpl("/home/public/thesauri/my.ontology"); TokenizerFactory tokenizerFactory = TokenizerFactoryImpl.createDefaultTokenizerFactory(new SBDtokenizer()); NormalizerFactory normalizerFactory = NormalizerFactoryImpl.createDefaultNormalizerFactory(new LVGNormalizer( "/home/public/LVG/lvg2006lite/data/config/lvg.properties")); Disambiguator disambiguator = new DisambiguatorImpl(new RuleDisambiguator[] { new StrictDisambiguator(), new LooseDisambiguator() }); DisambiguationDecisionMaker disambiguationDecisionMaker = new ThresholdDisambiguationDecisionMakerImpl(); Peregrine peregrne = new PeregrineImpl(ontology, tokenizerFactory, normalizerFactory, disambiguator, disambiguationDecisionMaker); String text = "I have super text with words like malaria and water"; List<IndexingResult> indexingResults = peregrne.index(text, Language.EN); for (IndexingResult indexingResult : indexingResults) { System.out.println("Found termId: " + indexingResult.getTermId().getConceptId() + ", matched text: " + text.substring(indexingResult.getStartPos(), indexingResult.getEndPos())); } } }
Profiling the memory usage of Peregrine
Peregrine has build-in support for memory profiling using wicket library. As this library is listed as optional in maven dependencies of org.erasmusmc.data-mining.peregrine.peregrine-impl-hash project, the end user of this dependency should explicitly add this dependency as runtime dependency, or add to WEB-INF/lib manually (in case of target deliverable is WAR application).
<dependency> <groupId>wicket</groupId> <artifactId>wicket</artifactId> <version>1.1</version> <scope>runtime</scope> </dependency>
After that the memory information is available via PeregrineImpl.toString() method.
Deploying Peregrine Service
(this page is incomplete; it will be updated when Peregrine installer is implemented)
- Download and run install script. After that deploy the resulting .war file to application server
Reference List
- ↑ For complete ontology backend providers, see ontology backends