Peregrine Usage

From BioAssist
Revision as of 14:43, 26 March 2010 by Dmitry Katsubo (Talk | contribs)

Jump to: navigation, search

Peregrine usage examples

Using Peregrine as library via Java API

  • Use this Maven repository to make Maven aware about Peregrine-related artifacts:
<repositories>
	<repository>
		<id>sara-artifactory-server-id</id>
		<name>SARA Artifactory - Peregrine Maven releases</name>
		<url>http://ws1.grid.sara.nl:21501/artifactory/libs-releases/</url>
	</repository>
</repositories>
  • The list of obligatory packages, that you need to include into your project is:
    • peregrine-api (includes ontology-api and common-utils)
    • peregrine-normalizer
    • peregrine-tokenizer
  • You need to decide, what will be your ontology provider. There are several options[1]:
    • File source ontology (ontology-impl-file)
    • DB source ontology (ontology-impl-db)
  • You need to decide, whether you need to disambiguate text indexing results (usually, you do). If yes, you need to include peregrine-disambiguator project. Disambiguator layer can be used out of the box without special configuration.

  • You have to decide, what Peregrine interface implementation you want to use. By the time of writing this article, there is only one implementation available: peregrine-impl-hash

Sample pom.xml configuration

Peregrine client dependency graph.png
<project
	xmlns="http://maven.apache.org/POM/4.0.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

	<modelVersion>4.0.0</modelVersion>

	<groupId>mygroup</groupId>
	<artifactId>peregrine-client</artifactId>
	<packaging>jar</packaging>
	<version>0.2-SNAPSHOT</version>

	<name>Peregrine Sample Client</name>
	<inceptionYear>2009</inceptionYear>

	<repositories>
		<repository>
			<id>sara-artifactory-server-id</id>
			<url>http://ws1.grid.sara.nl:21501/artifactory/libs-releases</url>
		</repository>
	</repositories>

	<dependencies>
		<dependency>
			<groupId>org.erasmusmc.data-mining.peregrine</groupId>
			<artifactId>peregrine-api</artifactId>
			<version>0.2-SNAPSHOT</version>
		</dependency>
		<dependency>
			<groupId>org.erasmusmc.data-mining.peregrine</groupId>
			<artifactId>peregrine-normalizer</artifactId>
			<version>0.2-SNAPSHOT</version>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.erasmusmc.data-mining.peregrine</groupId>
			<artifactId>peregrine-tokenizer</artifactId>
			<version>0.2-SNAPSHOT</version>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.erasmusmc.data-mining.peregrine</groupId>
			<artifactId>peregrine-disambiguator</artifactId>
			<version>0.2-SNAPSHOT</version>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.erasmusmc.data-mining.peregrine</groupId>
			<artifactId>peregrine-impl-hash</artifactId>
			<version>0.2-SNAPSHOT</version>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.erasmusmc.data-mining.ontology</groupId>
			<artifactId>ontology-impl-db</artifactId>
			<version>0.2-SNAPSHOT</version>
			<scope>runtime</scope>
		</dependency>
	</dependencies>
</project>

Sample ontology-impl-file configuration

<?xml version="1.0"?>
<beans
	xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.springframework.org/schema/beans	http://www.springframework.org/schema/beans/spring-beans-2.5.xsd">

	<bean name="ontology" class="org.erasmusmc.data_mining.ontology.impl.file.SingleFileOntologyImpl">
		<constructor-arg value="file:/home/user/ontology_data.txt" />
	</bean>
</beans>

Sample ontology-impl-db configuration

<?xml version="1.0"?>
<beans
	xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:tx="http://www.springframework.org/schema/tx"
	xsi:schemaLocation="
		http://www.springframework.org/schema/beans	http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
		http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-2.5.xsd">

	<bean id="ontology" class="org.erasmusmc.data_mining.ontology.impl.db.DBOntologyImpl" lazy-init="true">
		<constructor-arg>
			<bean class="org.springframework.jdbc.core.simple.SimpleJdbcTemplate">
				<constructor-arg ref="ontologyDataSource" />
			</bean>
		</constructor-arg>
	</bean>

	<bean id="ontologyDataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close" scope="singleton" lazy-init="true">
		<property name="driverClassName" value="com.mysql.jdbc.Driver" />
		<property name="url" value="jdbc:mysql://myserver:3306/mydatabase?autoReconnect=true" />
		<property name="username" value="dbuser" />
		<property name="password" value="dbpass" />
		<property name="validationQuery" value="select 1" />
	</bean>

	<bean id="txManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
		<property name="dataSource" ref="ontologyDataSource"/>
	</bean>

	<tx:annotation-driven transaction-manager="txManager" />
</beans>

Sample peregrine-impl-hash configuration

<?xml version="1.0"?>
<beans
	xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.springframework.org/schema/beans	http://www.springframework.org/schema/beans/spring-beans-2.5.xsd">

	<bean name="thresholdDisambiguationDecisionMaker" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.ThresholdDisambiguationDecisionMakerImpl">
		<property name="disambiguationMinimalWeight" value="50" />
		<property name="disambiguationAlwaysAcceptedWeight" value="80" />
	</bean>

	<bean name="ruleDisambiguator" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.DisambiguatorImpl">
		<constructor-arg>
			<list>
				<ref local="looseDisambiguator" />
				<ref local="strictDisambiguator" />
			</list>
		</constructor-arg>
	</bean>

	<bean id="looseDisambiguator" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.LooseDisambiguator">
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsHomonymRule" />
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsPreferredTermRule" />
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.HasSynonymRule">
				<property name="maxSynonymDistance" value="40" />
				<property name="minSynonymWeight" value="75" />
				<property name="maxSynonymWeight" value="80" />
			</bean>
		</constructor-arg>
	</bean>

	<bean id="strictDisambiguator" class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.StrictDisambiguator">
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsHomonymRule" />
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsPreferredTermRule" />
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.IsComplexRule">
				<property name="maxTermLength" value="6" />
				<property name="minTermLength" value="3" />
				<property name="minTermNumbers" value="1" />
				<property name="minTermLetters" value="1" />
			</bean>
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.HasSynonymRule">
				<property name="maxSynonymDistance" value="40" />
				<property name="minSynonymWeight" value="75" />
				<property name="maxSynonymWeight" value="80" />
			</bean>
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule.HasKeywordRule">
				<property name="maxKeywordDistance" value="300" />
				<property name="minKeywordWeight" value="75" />
				<property name="maxKeywordWeight" value="80" />
			</bean>
		</constructor-arg>
	</bean>

	<bean name="peregrine" class="org.erasmusmc.data_mining.peregrine.impl.hash.PeregrineImpl">
		<constructor-arg ref="ontology" />
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.tokenizer.impl.SubSentenceTokenizer" />
		</constructor-arg>
		<constructor-arg>
			<bean class="org.erasmusmc.data_mining.peregrine.normalizer.impl.LVGNormalizer" />
		</constructor-arg>
		<constructor-arg ref="ruleDisambiguator" />
		<constructor-arg ref="thresholdDisambiguationDecisionMaker" />
	</bean>
</beans>

Using one of existing Peregrine projects

peregrine-ws is an example of exposing Peregrine interface as WebService. This is done using JAX-WS technology. JAX-WS provides a build-in servlet that serves registered HTTP endpoints.

peregrine-rmi is a nice example of how Peregrine interface can be exposed as RMI using build-in Spring framework facilities. However, if all RMI-staff is removed, you get an example of some command-line peregrine utility that can take some file as an argument for example, index it and write to another file.

peregrine-client is an example of using Peregine as a library in WebApplication. It provides simple JSP presentation of Peregine indexing results. As JSP is also a servlet, instead of speaking HTML it can produce XML (and act as REST service).

Not using maven & spring

You have to download the release package, which contains all necessary libraries. After adding the dependencies you needed to your project, you can program a simple scenario:

import java.io.Serializable;
import java.util.List;

import org.erasmusmc.data_mining.ontology.api.Concept;
import org.erasmusmc.data_mining.ontology.api.Language;
import org.erasmusmc.data_mining.ontology.api.Ontology;
import org.erasmusmc.data_mining.ontology.common.LabelTypeComparator;
import org.erasmusmc.data_mining.ontology.impl.file.SingleFileOntologyImpl;
import org.erasmusmc.data_mining.peregrine.api.IndexingResult;
import org.erasmusmc.data_mining.peregrine.api.Peregrine;
import org.erasmusmc.data_mining.peregrine.disambiguator.api.DisambiguationDecisionMaker;
import org.erasmusmc.data_mining.peregrine.disambiguator.api.Disambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.api.RuleDisambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.ThresholdDisambiguationDecisionMakerImpl;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule_based.LooseDisambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule_based.StrictDisambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule_based.TypeDisambiguatorImpl;
import org.erasmusmc.data_mining.peregrine.impl.hash.PeregrineImpl;
import org.erasmusmc.data_mining.peregrine.normalizer.api.NormalizerFactory;
import org.erasmusmc.data_mining.peregrine.normalizer.impl.LVGNormalizer;
import org.erasmusmc.data_mining.peregrine.normalizer.impl.NormalizerFactoryImpl;
import org.erasmusmc.data_mining.peregrine.tokenizer.api.TokenizerFactory;
import org.erasmusmc.data_mining.peregrine.tokenizer.impl.SBDtokenizer;
import org.erasmusmc.data_mining.peregrine.tokenizer.impl.TokenizerFactoryImpl;

public class Main {

	public static void main(String[] args) {
		// FlyweightProcessingOntology ontology = new FileFlyweightProcessingOntologyImpl("C:/my.ontology.bz2");
		Ontology ontology = new SingleFileOntologyImpl("/home/file/my.ontology");

		TokenizerFactory tokenizerFactory = TokenizerFactoryImpl.createDefaultTokenizerFactory(new SBDtokenizer());
		NormalizerFactory normalizerFactory = NormalizerFactoryImpl.createDefaultNormalizerFactory(new LVGNormalizer(
				"/home/public/LVG/lvg2006lite/data/config/lvg.properties"));
		Disambiguator disambiguator = new TypeDisambiguatorImpl(new RuleDisambiguator[] { new StrictDisambiguator(),
				new LooseDisambiguator() });
		DisambiguationDecisionMaker disambiguationDecisionMaker = new ThresholdDisambiguationDecisionMakerImpl();

		Peregrine peregrne = new PeregrineImpl(ontology, tokenizerFactory, normalizerFactory, disambiguator,
				disambiguationDecisionMaker);

		String text = "I have super text with words like malaria and water";

		List<IndexingResult> indexingResults = peregrne.index(text, Language.EN);

		for (IndexingResult indexingResult : indexingResults) {
			Serializable conceptId = indexingResult.getTermId().getConceptId();

			System.out.println("Found conceptId: " + conceptId + ", matched text: "
					+ text.substring(indexingResult.getStartPos(), indexingResult.getEndPos()));

			Concept concept = ontology.getConcept(conceptId);

			System.out.println("Concept label is: " + LabelTypeComparator.getPreferredLabel(concept.getLabels()));
		}
	}
}

Profiling the memory usage of Peregrine

Peregrine has build-in support for memory profiling using wicket library. As this library is listed as optional in maven dependencies of org.erasmusmc.data-mining.peregrine.peregrine-impl-hash project, the end user of this dependency should explicitly add this dependency as runtime dependency, or add to WEB-INF/lib manually (in case of target deliverable is WAR application).

<dependency>
	<groupId>wicket</groupId>
	<artifactId>wicket</artifactId>
	<version>1.1</version>
	<scope>runtime</scope>
</dependency>

After that the memory information is available via PeregrineImpl.toString() method.

Deploying Peregrine Service

You can download the installer, which will guide you through several steps to create a final package:

  • After information panel you need to agree with software license.
  • On package selection panel you need to define what kind of final deliverable is requested (WebService as WAR file or RMI service as ZIP file) and the packages that will be included into the package.
  • The installer then downloads the package from repository and extracts the information about the configurable parameters (properties) from it.
  • The user is suggested to change these parameters and to define the destination name for the package.
  • Finally, installer injects the parameter into the package and saves it under the name specified.

Reference List

  1. For complete ontology backend providers, see ontology backends