From BioAssist
Jump to: navigation, search

This webpage is no longer maintained and only here for archive purposes. Please refer to https://trac.nbic.nl/gscf and http://dbnp.org for up to date information on this project.


This page intends to collect initiatives which have (partly) the same aim, and is a basis to discuss how we relate to them. If you are reading this and something comes to mind, please feel free to add it.

Biological study metadata storage

In dbNP, we want to store the metadata about biological studies: the study subjects, study design, study samples, all the information that in the end specifies where, when and how the sample was taken on which 'omics assays' were performed.


MIBBI (http://mibbi.org) is a project which lists a lot of workgroups that are writing a specification of what the minimal required information should be for a specific kind of biological experiments. The projects are not all in the same stage. For example, the different MIAMI guidelines (for microarray experiments) are quite developed, however, the CIMR (metabolomics experiment guidelines, from the MSI workgroup) seems less implemented. Whenever we are defining templates to store biological information, we should definitely use the documents here as a resource. A lot of them are PDF documents, but there are also frequent links to ontologies.


FuGE (Functional Genomics Experiment, http://fuge.sourceforge.net) is a standard for specifying details about functional genomics experiments. It is a comprehensive and quite complex object model, because it aims at capturing complete laboratory workflows. See also DbNPFuGEOM.


The biological data area where data storage and querying are in the most advanced stage is probably transcriptomics. MAGE-TAB (http://www.mged.org/mage-tab) is a format for storing biological study metadata in a MIAME (http://www.mged.org/Workgroups/MIAME/miame.html) compliant way, and it is used a lot. For example, many transcriptomics studies are stored in ArrayExpress (http://www.ebi.ac.uk/microarray-as/ae/), have their biological metadata stored in MAGE-TAB (see http://www.ebi.ac.uk/microarray/doc/help/MAGE-TAB.html). A tool to create and edit MAGE-TAB files (written in Adobe AIR) is Annotare (http://code.google.com/p/annotare). Other examples of online proteomics databases which have study design descriptions are GEO (http://www.ncbi.nlm.nih.gov/geo) and caArray (https://array.nci.nih.gov/caarray/home.action).

See also DbNPMageTabOM.


The ISATAB (Investigation - Study - Assay tabular) format is a quite general format (it is deliberately kept simple) which is intented to specify metadata (study subjects, design etc.) about biological investigations in a structured and consistent way. Also, there are tools being developed to write and store ISATAB data at EBI (http://isatab.sourceforge.net). The source code of these efforts is announced on the website, but is not available yet. Binaries are provided.

The ISATAB format is an important inspirator of this project. We aim at the ability to export to ISATAB from our metadata storage. However, since the ISATAB tools are not open source (and the tools were not sufficient for our current users), we cannot work with them.

The ISATAB format is almost the same as the IDF part of the MAGETAB format, with the different that in ISATAB we have a new Investigation layer which allows grouping of different Studies into an Investigation package.


Molgenis (http://www.molgenis.org) is a code generator which can generate a full biological database application (database backend, web user interface, webservices, R interface) from just an XML description of the data model. It can be used to quickly setup a database for a specific domain, and a number of examples have already been implemented and are available online (see http://www.molgenis.org/wiki/MolgenisSystemList). It also generates a UML schema from your data mode, see for example the implementation of the MAGE-TAB standard: http://wwwdev.ebi.ac.uk/microarray-srv/magetab/doc/objectmodel.html#__figure_of_complete_schema, which has an importer for MAGETAB IDF and SDRF files.


XGAP is a genotype/phenotype database (mainly focused on genetics/genomics: QTL analysis, GWAS studies etc.), running on top of Molgenis. A sample implementation can be found here. The XGAP project also has an own tab-delimited format, specified at http://www.xgap.org/wiki/XgapFormatReference. It is also able to import any type of data that is structured in a matrix, which gives it a simple general way to store (clean) data, such as genetic markers or metabolite concentrations.


The Pheno-OM data model is a data model to capture phenotypes and the protocols used to collect them with the purpose to be a reusable module in various systems. It was conceived because within the GEN2PHEN project there was a realization that many projects, ranging from biobanks to high-throughput, need to capture phenotypes but that there is no community concensus on how to structure this data. Core structures in this model are based on common concepts from MAGE-TAB, FuGE, XGAP and PaGE-OM as documented at http://wwwdev.ebi.ac.uk/microarray-srv/pheno/doc/objectmodel.html. A MOLGENIS based reference implementation exists and can be found at http://wwwdev.ebi.ac.uk/microarray-srv/pheno/molgenis.do. It is expected that this model will be added to existing projects to enrich the phenotype capturing, in particular in light of biobanking efforts, and may end up as add-on in XGAP, MAGE-TAB, and HGVbaseG2P. The GEN2PHEN standards portal can be found at http://www.gen2phen.org/wiki/standards.


Ibidas (https://wiki.nbic.nl/index.php/Ibidas) is a storage system (where in my understanding you can use database servers, but also flat files as backend) which aims at storing different kinds of biological data in an ontology-aware manner. At the moment, it seems to be targeting mainly genomic data.


SysMO-DB (http://www.sysmo-db.org) is a project that was initially aimed at data exchange in Systems Biology for Micro-organisms, "but the principles and methods employed are equally applicable to other multi-site Systems Biology projects". Indeed their demo (http://demo.sysmo-db.org) shows among other things how to incorporate MAGE-TAB studies in their database.


Ontocat (Ontology Common API tasks) is a Java API for common ontology tasks, just as the name says. It's main advantage is that is gives a common interface to access any of BioPortal, OLS (EBI Ontology Lookup Service, the European counterpart) and local OWL/OBO Foundry files. Joining this effort might be a very good level to cooperate with other open source iniatives. For example, if BioPortal were to change their output format again, we could join with those people (which are also involved with Gen2Phen) to get the system up again. The project page is http://ontocat.sourceforge.net/index.html, the API is at http://ontocat.sourceforge.net/doc/index.html and for required libraries see https://ontocat.svn.sourceforge.net/svnroot/ontocat/trunk/ontoCAT/lib/.


In the ISATAB an MAGETAB models, Experimental Factors play an important role in describing study design. The EBI has been developing an experimental factor ontology called EFO which can be browsed via the OLS at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EFO.

RDF formats

There are numerous Semantic Web initatives for life sciences on different levels, such as the ConceptWiki, several myGrid projects, or on a smaller scale, SNPedia. Many are gathered by this W3C SIG, see also an older page. However, most of them seem concerned with the storage of knowledge, not so much actual data or study descriptions. The following is an inventory of RDF formats/projects that do focus on describing biological studies (study aims, goals, persons, publications, but also: subjects, events, samples, assays). Our goal is to find an RDF or even XMI exchange format that could act as a candidate for exchanging data with e.g. several of the above-mentioned MOLGENIS based study capture tools. (Using a format like ISATAB would require a lot of extension, because if we use the current standard we would lose too much detail). For a nice description of the relation between XML, UML and XMI, see this page.


It might be interesting to note that the just-mentioned ConceptWiki project, initiated by the Concept Web Alliance, doesn't use a conventional RDBMS as backend. Instead, it uses a graph database engine called neo4j. An overview of No-SQL databases can be found here.


EXPO is an example of a an ontology (in this case OWL) for describing scientific experiments. However, this ontology is clearly too general for our purposes. We would like to describe biological subjects, omics assays etc.


The Semantic Web Applications in Neuromedicine project has ontologies for the Neuromedicine domain. Of the different ontologies, the Scientific discourse ontologies might be of interest. However, like EXPO, this one is probably too general for our purposes.


The MORFEO project is an open-source software community concerned with Service Orientated Architectures. There is a subproject about an ontology for units of measurements, with a very nice explanation of the status quo of how to do this in RDF here.


http://code.google.com/p/annotare/ Annotare is a tool for annotating biomedical investigations and resulting data. It is a stand-alone desktop application that features 1) a set of intuitive editor forms to create and modify annotations, 2) support for easy incorporation of terms from biomedical ontologies, 3) standard templates for common experiment types, 4) a design wizard to help create a new document, and 5) a validator that checks for syntactic and semantic violation (see figure below). Annotare will help a bench biologist construct a MIAME-compliant annotation file based on the MAGE-TAB format. File:Http://annotare.googlecode.com/files/AnnotareComponents.jpg

myGrid e-Labs Research Objects (?)

See this page. Research Objects seem at least partially concerned with storing study information, but also of course research methods and results (which makes sense within the myGrid context). One example startup is Obesity e-Lab, however, this example does not really see developed, and I could not find any concrete data model.


OpenBIS is a metadata and data storage system for biological experiments with a server/client architecture. OpenBIS has Dataspaces, Projects, Experiments, Samples and Datasets. Metadata can be added on any of these levels, using Property Types, which are very similar to Template fields in GSCF. If compared with the data structure of GSCF, the entities of GSCF mainly would fall in between OpenBIS Experiments and Samples. There are no Events or Subjects in OpenBIS (but you can define relations between Samples). It seems to be mainly targeted at managing and storing larger (possibly multi-omics-platform) datasets. Screencasts can be watched here.


LabKey is a (mature) open source platform for storage of biological studies. There are target versions for a.o. Proteomics. It's released under Apache 2.0, written in Java and it can be run in Tomcat. It can store different types of assay data, and also has an integrated R server for data viewing purposes.


SetupX is a platform for study capturing developed by the Metabolomics Fiehn Lab at UC Davis. A running instance can be found at http://fiehnlab.ucdavis.edu:8080/m1/. SetupX integrates with BinBase, a tool aimed at metabolite identification in metabolomics data.

Biological study query

In dbNP, we want to be able to pose a lot of different biological queries to the database, which relate to both the metadata and the omics clean data, which is probably stored in different data warehouses.


Again, for transcriptomics, a lot of nice tools are already developed. For example, ATLAS (http://www.ebi.ac.uk/gxa) is a tool that lets you interactively query gene expressions in a subset of studies from ArrayExpress (see MAGE-TAB above). The nice thing about this is that it is also coupled to the biological metadata, which is used in the graphs that are produced when you take a detailed look at the query results.


Galaxy (http://bitbucket.org/galaxy/galaxy-central/wiki/Home) is a web-based query tool which is able to load genomic data from a lot of public datasources, and do all kinds of basic data operations on them. It also allows you to store the operations you used in a workflow for future use.


Taverna (http://taverna.sourceforge.net) is a workflow system which uses web services as building blocks to generate a data manipulation pipeline. MyExperiment (http://www.myexperiment.org) is a server on which you can share Taverna workflows with other users. We certainly could use the idea to store and share workflows. Whether we should use webservices to organize this, is to be decided.


BII is the ISATAB database from EBI. A public instance is hosted at http://www.ebi.ac.uk/bioinvindex, and it contains some example studies. It links to several other EBI data stores such as ArrayExpress and PRIDE. Its query options are limited: you can only filter on organism, measurement, technology and platform.


UniProt is an interesting project for us. UniProt is a widely used database for protein sequence and functional information. The query interface is very sophisticated (for a demo, see http://www.uniprot.org/demos/diabetes), starting from full text querying, but providing the user with options to narrow down and structure the search. Technically, the project is also interesting. UniProt contains a lot of references to external databases, and therefore, it was difficult to find a solution which gave a good abstraction from data storage details on one hand, while still providing good large-scale performance for queries on the other hand. In the end, the team choose to describe data relations in RDF, and built their own RDF inference/query engine called expasy4j. Read the full story here.

Querying and searching

For querying and searching, we can probably learn from the Apache Lucene project, and the Grails Searchable plugin. An example Grails project called GATEWiki, a semantic wiki using Nutch and Solr (both building on Lucene), can be found at http://gatewiki.sourceforge.net.


An example of a mature data interchange protocol, in this case sequence annotations, is BioDAS. It is used by a.o. Ensembl and Gbrowse to exchange sequence annotations.


The OpenTox 1.1 API specification might be of interest because it connects to multiple (tox/bioinformatic) query components via a uniform REST interface. OpenTox modules also use RDF to annotate the data that is provided. See this site (under development) for an online running demo instance of Ambit, an OpenTox module that implements this API.

Omics data processing

In dbNP, to answer specific omics questions, sometimes data manipulation on the clean or even the raw data needs to be performed. We can only incorporate those types of queries when we have very specific requirements (see DbNPFeatures). The following list serves as a summary of candidate tools which we could link into dbNP to perform those analyses.


GenePattern is a web application for performing data analyses and data processing tasks, built and maintained actively by the Broad Institute of MIT. It is also possible to create pipelines which link multiple processing and analysis steps together. GenePattern has a built-in job management system which keeps track of all computations and their results. A GenePattern module is little more than a shell around a piece of Java, R or MATLAB code, which keeps track of the submitted parameters and files. A repository containing about 130 modules of genomics, transcriptomics and metabolomics analysis/processing is maintained at the Broad Institute. The obvious advantage of using GenePattern above e.g. R packages is that can also contain Java and MATLAB code, and has file and job management. It is also possible to host a private GenePattern module repository. Whenever analysis methods are available as e.g. BioConductor they can of course be repackaged in a GenePattern module; NuGO members have done so. GenePattern has a SOAP interface. GenePattern can be downloaded and installed locally, it is licensed under the MIT license.


[MetaboAnalyst] is a web service which allows users to upload (raw) metabolomics data and process and analyze them. Like GenePattern, it uses R/Bioconductor packages to perform these analyses. It uses JSF and Rserve to provide the integration (see this article). At this moment I cannot find any details about source code or licensing.


[caBIG] is a large platform for cancer research which integrates many aspects that we also cover. It has comprehensive tools for managing data of patients in clinical trials (clinical data, adverse events, study participation etc.). It also has a number of subprojects which might be of interest, for example the caCORE SDK (which seems more or less comparable to Molgenis) and the accompanying Workbench, but also a number of tools for genetic data (geWorkbench and caGWAS for GWAS) and RNA microarrays (caArray and also geWorkbench). We can probably also learn from the way they handle grid processing and security. TODO: a more comprehensive comparison, especially to drill down into the details of study capture and to see whether there is a possibility in caBIG (tools) to query omics data in the way we intend to do this in dbNP - by linking to treatments, compounds, genes etc.


Bioclipse is an open source workbench (based on the Eclipse RCP) for chemo- and bioinformatics. It particularly has support for querying and viewing compound structures, building on the Chemistry Development Kit and leveraging structure viewers such as JMol. It might especially be interesting for the metabolomics submodule.

Linking to ontologies

We currently use both a self-written ontology widget (see DbNP Technical Documentation#Ontology Chooser) and the Ontocat library (http://ontocat.sourceforge.net, see above) for contacting BioPortal ontologies. From numerous tests and discussions with users, it became clear that the most convenient way to use ontologies to describe studies was to have a short 'cached' list of terms that are used in the database, plus the possibility to add more terms on the fly when needed. For example, the species list starts out with a limited number of well-known model species:


But if a user wants to choose a species that is not yet in the list (but should be in the ontology), he or she can click 'Add more' and add the term:


For this screen, the application connects to the BioPortal ontology (in this case the NCBI Species Ontology) via the Ontology Chooser widget to search for terms that match or have a matching synonym to the text the user is typing. If the user clicks 'add term', the term is stored locally in the database (all required properties are already passed via the BioPortal REST interface, so we don't need an additional fetch for that). To actually add a new ontology, we currently rely on the Ontocat library to fetch information about the ontology.

Features to-do

Get database-specific ontology IDs

NCBO ontologies have two IDs: the ID which identifies the ontology in general (e.g. 1132 for the NCBI species ontology) and a version-specific ID (e.g. 38802 for version 1.2 of that ontology). The ontology chooser widget needs them both, so we need to store both in the database.

Via the Ontocat library, we had problems to fetch the versioned ID, but it turned out that you can get the versioned id just by asking for the 'id' property:

static Ontology getBioPortalOntology(String ncboId) {
	// Get ontology from BioPortal via Ontocat
	// TODO: maybe make a static OntologyService instance to be more efficient, and decorate it with caching?
	uk.ac.ebi.ontocat.OntologyService os = new uk.ac.ebi.ontocat.bioportal.BioportalOntologyService()
	uk.ac.ebi.ontocat.Ontology o = os.getOntology(ncboId)
	// Instantiate and return Ontology object
	new dbnp.data.Ontology(
		name: o.label,
		description: o.description,
		url: o.properties['homepage'] ?: "http://bioportal.bioontology.org/ontologies/${o.id}",
		versionNumber: o.versionNumber,
		ncboId: o.ontologyAccession,
		ncboVersionedId: o.id

Also, for this particular use case the advantage of using Ontocat is not very clear, because we can also do it easily directly via the NCBO REST service:

// use the NCBO REST service to fetch ontology information
def url = "http://rest.bioontology.org/bioportal/ontologies/" + ncboVersionedId
def xml = new URL(url).getText()
def data = new XmlParser().parseText(xml)
def bean = data.data.ontologyBean
// instantiate Ontology with the proper values
def ontology = new dbnp.data.Ontology(
	name: bean.displayLabel.text(),
	description: bean.description.text(),
	url: bean.homepage.text(),
	versionNumber: bean.versionNumber.text(),
	ncboId: bean.ontologyId.text() as int,
	ncboVersionedId: bean.id.text() as int

This way, we do not have to include Ontocat, which boosts startup time because we had to import a number of JARs. The performance difference between these approaches run-time is hard to measure, but both appear acceptable for a user (a few seconds, depending almost entirely on the time the BioPortal REST call takes to return). Maybe if we use the caching decorator for Ontocat that approach will be faster, but because of the way we use ontologies fetching information about an ontology is a rare event anyway (only happens when the user changes the definition of an ontology-related template field).

Another advantage of using the second approach is that you can directly use the 'ncboVersionedId' to get the ontology, which happens to be the id you also get back when you are using the BioPortal REST service to query for terms. Whereas with Ontocat, you have to specify the 'ncboId' to get the ontology, which needs conversion first.

Of course, this brings up another question: whether it's wise to stick to the ncboId to define the ontologies for ontology template fields, or the ncboVersionedId. The first option would imply that you always get the latest version, but the latter option means that you stay versioned, so that terms that are used in the database cannot get deleted. We have not seen a concrete example of an ontology update yet which involves our project, so we do not have experience to make an educated choice in this. We currently stick to using the versioned ontology id. Which is already bugging us...

Use relationships to filter on terms

It would be very nice if we could attach criteria to templatefield, to specify not only the source ontology, but also e.g. the level of the term. We have two concrete user wishes for that, which are in the following paragraphs. We also asked an expert opinion from Tomasz Amadusiak on how to do this, see the talk page.

Have an ontology indicate the measurement unit

If we could indicate the unit of each (numerical) template field using an ontology, that would be really great. Especially so when there would be a chance using RDF or OWL to convert units automatically Wolfram Alpha style when you query, whenever they are comparable (see e.g. this project). However, that be may be a little too far-fetched.

But at least if we stick to a good units of measurement ontology now, we know when units are of the same type. The Units of measurement ontology in BioPortal looks like a good candidate for this.

There is one problem with this ontology, though. Some much-used actual derived units are not in there. For example, mg/dL is not a term in the ontology, only mg/mL. But mg/dL is a widely used unit for e.g. cholesterol concentration in humans, and the users do not want to convert the units in something more SI-compliant, because mg/dL gives them the ranges they are used to look at for that particular compound. More or less the same goes for nmol/L, which equivalent to nMol (nanomolair).

Most widely-used measurement units are in the NCI Thesaurus, as a child of the 'Unit of measurement' term. (At least that is how it appears in the browser, is this defined by a parent-child relationship or by a 'is-a' relationship?) So it would be nice if we could specify to Ontocat that only terms below a certain term should be returned as the result of a query.

Show only actual species in the species ontology

In the NCBI organismal classification, you have basically the whole Linnaeus tree, including kingdom, phylum, class etc. up to species, but also sometimes subspecies. It would be nice if you could filter the terms to only contain species, not any other level. However, this might prove difficult, because at least in NCBO there are no properties which indicate on which level the terms are. Also, the problem is somewhat alleviated by our ontology field strategy, where we have a local cache with the ontology fields that are actually added by the users.