From BioAssist
Revision as of 20:20, 2 June 2010 by Mswertz (Talk | contribs)

Jump to: navigation, search


This page intends to collect initiatives which have (partly) the same aim, and is a basis to discuss how we relate to them. If you are reading this and something comes to mind, please feel free to add it.

Biological study metadata storage

In dbNP, we want to store the metadata about biological studies: the study subjects, study design, study samples, all the information that in the end specifies where, when and how the sample was taken on which 'omics assays' were performed.


MIBBI (http://mibbi.org) is a project which lists a lot of workgroups that are writing a specification of what the minimal required information should be for a specific kind of biological experiments. The projects are not all in the same stage. For example, the different MIAMI guidelines (for microarray experiments) are quite developed, however, the CIMR (metabolomics experiment guidelines, from the MSI workgroup) seems less implemented. Whenever we are defining templates to store biological information, we should definitely use the documents here as a resource. A lot of them are PDF documents, but there are also frequent links to ontologies.


FuGE (Functional Genomics Experiment, http://fuge.sourceforge.net) is a standard for specifying details about functional genomics experiments. It is a comprehensive and quite complex object model, because it aims at capturing complete laboratory workflows. See also DbNPFuGEOM.


The biological data area where data storage and querying are in the most advanced stage is probably transcriptomics. MAGE-TAB (http://www.mged.org/mage-tab) is a format for storing biological study metadata in a MIAME (http://www.mged.org/Workgroups/MIAME/miame.html) compliant way, and it is used a lot. For example, many transcriptomics studies are stored in ArrayExpress (http://www.ebi.ac.uk/microarray-as/ae/), have their biological metadata stored in MAGE-TAB (see http://www.ebi.ac.uk/microarray/doc/help/MAGE-TAB.html). A tool to create and edit MAGE-TAB files (written in Adobe AIR) is Annotare (http://code.google.com/p/annotare). Other examples of online proteomics databases which have study design descriptions are GEO (http://www.ncbi.nlm.nih.gov/geo) and caArray (https://array.nci.nih.gov/caarray/home.action).

See also DbNPMageTabOM.


The ISATAB (Investigation - Study - Assay tabular) format is a quite general format (it is deliberately kept simple) which is intented to specify metadata (study subjects, design etc.) about biological investigations in a structured and consistent way. Also, there are tools being developed to write and store ISATAB data at EBI (http://isatab.sourceforge.net). The source code of these efforts is announced on the website, but is not available yet. Binaries are provided.

The ISATAB format is an important inspirator of this project. We aim at the ability to export to ISATAB from our metadata storage. However, since the ISATAB tools are not open source, we cannot work with them.


Molgenis (http://www.molgenis.org) is a code generator which can generate a full biological database application (database backend, web user interface, webservices, R interface) from just an XML description of the data model. It can be used to quickly setup a database for a specific domain, and a number of examples have already been implemented and are available online (see http://www.molgenis.org/wiki/MolgenisSystemList). It also generates a UML schema from your data mode, see for example the implementation of the MAGE-TAB standard: http://wwwdev.ebi.ac.uk/microarray-srv/magetab/doc/objectmodel.html#__figure_of_complete_schema, which has an importer for MAGETAB IDF and SDRF files.


XGAP is a genotype/phenotype database (mainly focused on genetics/genomics: QTL analysis, GWAS studies etc.), running on top of Molgenis. A sample implementation can be found here. The XGAP project also has an own tab-delimited format, specified at http://www.xgap.org/wiki/XgapFormatReference. It is also able to import any type of data that is structured in a matrix, which gives it a simple general way to store (clean) data, such as genetic markers or metabolite concentrations.


The Pheno-OM data model is a data model to capture phenotypes and the protocols used to collect them with the purpose to be a reusable module in various systems. It was conceived because within the GEN2PHEN project there was a realization that many projects, ranging from biobanks to high-throughput, need to capture phenotypes but that there is no community concensus on how to structure this data. Core structures in this model are based on common concepts from MAGE-TAB, FuGE, XGAP and PaGE-OM as documented at http://wwwdev.ebi.ac.uk/microarray-srv/pheno/doc/objectmodel.html. A MOLGENIS based reference implementation exists and can be found at http://wwwdev.ebi.ac.uk/microarray-srv/pheno/molgenis.do. It is expected that this model will be added to existing projects to enrich the phenotype capturing, in particular in light of biobanking efforts, and may end up as add-on in XGAP, MAGE-TAB, and HGVbaseG2P.


Ibidas (https://wiki.nbic.nl/index.php/Ibidas) is a storage system (where in my understanding you can use database servers, but also flat files as backend) which aims at storing different kinds of biological data in an ontology-aware manner. At the moment, it seems to be targeting mainly genomic data.


SysMO-DB (http://www.sysmo-db.org) is a project that was initially aimed at data exchange in Systems Biology for Micro-organisms, "but the principles and methods employed are equally applicable to other multi-site Systems Biology projects". Indeed their demo (http://demo.sysmo-db.org) shows among other things how to incorporate MAGE-TAB studies in their database.


Ontocat (Ontology Common API tasks) is a Java API for common ontology tasks, just as the name says. It's main advantage is that is gives a common interface to access any of BioPortal, OLS (EBI Ontology Lookup Service, the European counterpart) and local OWL/OBO Foundry files. Joining this effort might be a very good level to cooperate with other open source iniatives. For example, if BioPortal were to change their output format again, we could join with those people (which are also involved with Gen2Phen) to get the system up again. The project page is http://ontocat.sourceforge.net/index.html, the API is at http://ontocat.sourceforge.net/doc/index.html and for required libraries see https://ontocat.svn.sourceforge.net/svnroot/ontocat/trunk/ontoCAT/lib/.


In the ISATAB an MAGETAB models, Experimental Factors play an important role in describing study design. The EBI has been developing an experimental factor ontology called EFO which can be browsed via the OLS at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EFO.

Biological study query

In dbNP, we want to be able to pose a lot of different biological queries to the database, which relate to both the metadata and the omics clean data, which is probably stored in different data warehouses.


Again, for transcriptomics, a lot of nice tools are already developed. For example, ATLAS (http://www.ebi.ac.uk/gxa) is a tool that lets you interactively query gene expressions in a subset of studies from ArrayExpress (see MAGE-TAB above). The nice thing about this is that it is also coupled to the biological metadata, which is used in the graphs that are produced when you take a detailed look at the query results.


Galaxy (http://bitbucket.org/galaxy/galaxy-central/wiki/Home) is a web-based query tool which is able to load genomic data from a lot of public datasources, and do all kinds of basic data operations on them. It also allows you to store the operations you used in a workflow for future use.


Taverna (http://taverna.sourceforge.net) is a workflow system which uses web services as building blocks to generate a data manipulation pipeline. MyExperiment (http://www.myexperiment.org) is a server on which you can share Taverna workflows with other users. We certainly could use the idea to store and share workflows. Whether we should use webservices to organize this, is to be decided.


BII is the ISATAB database from EBI. A public instance is hosted at http://www.ebi.ac.uk/bioinvindex, and it contains some example studies. It links to several other EBI data stores such as ArrayExpress and PRIDE. Its query options are limited: you can only filter on organism, measurement, technology and platform.


UniProt is an interesting project for us. UniProt is a widely used database for protein sequence and functional information. The query interface is very sophisticated (for a demo, see http://www.uniprot.org/demos/diabetes), starting from full text querying, but providing the user with options to narrow down and structure the search. Technically, the project is also interesting. UniProt contains a lot of references to external databases, and therefore, it was difficult to find a solution which gave a good abstraction from data storage details on one hand, while still providing good large-scale performance for queries on the other hand. In the end, the team choose to describe data relations in RDF, and built their own RDF inference/query engine called expasy4j. Read the full story here.

Querying and searching

For querying and searching, we can probably learn from the Apache Lucene project, and the Grails Searchable plugin. An example Grails project called GATEWiki, a semantic wiki using Nutch and Solr (both building on Lucene), can be found at http://gatewiki.sourceforge.net.

Omics data processing

In dbNP, to answer specific omics questions, sometimes data manipulation on the clean or even the raw data needs to be performed. We can only incorporate those types of queries when we have very specific requirements (see DbNPFeatures). The following list serves as a summary of candidate tools which we could link into dbNP to perform those analyses.


GenePattern is a web application for performing data analyses and data processing tasks, built and maintained actively by the Broad Institute of MIT. It is also possible to create pipelines which link multiple processing and analysis steps together. GenePattern has a built-in job management system which keeps track of all computations and their results. A GenePattern module is little more than a shell around a piece of Java, R or MATLAB code, which keeps track of the submitted parameters and files. A repository containing about 130 modules of genomics, transcriptomics and metabolomics analysis/processing is maintained at the Broad Institute. The obvious advantage of using GenePattern above e.g. R packages is that can also contain Java and MATLAB code, and has file and job management. It is also possible to host a private GenePattern module repository. Whenever analysis methods are available as e.g. BioConductor they can of course be repackaged in a GenePattern module; NuGO members have done so. GenePattern has a SOAP interface. GenePattern can be downloaded and installed locally, it is licensed under the MIT license.


[MetaboAnalyst] is a web service which allows users to upload (raw) metabolomics data and process and analyze them. Like GenePattern, it uses R/Bioconductor packages to perform these analyses. It uses JSF and Rserve to provide the integration (see this article). At this moment I cannot find any details about source code or licensing.


[caBIG] is a large platform for cancer research which integrates many aspects that we also cover. It has comprehensive tools for managing data of patients in clinical trials (clinical data, adverse events, study participation etc.). It also has a number of subprojects which might be of interest, for example the caCORE SDK (which seems more or less comparable to Molgenis) and the accompanying Workbench, but also a number of tools for genetic data (geWorkbench and caGWAS for GWAS) and RNA microarrays (caArray and also geWorkbench). We can probably also learn from the way they handle grid processing and security. TODO: a more comprehensive comparison, especially to drill down into the details of study capture and to see whether there is a possibility in caBIG (tools) to query omics data in the way we intend to do this in dbNP - by linking to treatments, compounds, genes etc.

Linking to ontologies

We currently use both a self-written ontology widget (see DbNP Technical Documentation#Ontology Chooser) and the Ontocat library (http://ontocat.sourceforge.net, see above) for contacting BioPortal ontologies. From numerous tests and discussions with users, it became clear that the most convenient way to use ontologies to describe studies was to have a short 'cached' list of terms that are used in the database, plus the possibility to add more terms on the fly when needed. For example, the species list starts out with a limited number of well-known model species:


But if a user wants to choose a species that is not yet in the list (but should be in the ontology), he or she can click 'Add more' and add the term:


For this screen, the application connects to the BioPortal ontology (in this case the NCBI Species Ontology) via the Ontology Chooser widget to search for terms that match or have a matching synonym to the text the user is typing. If the user clicks 'add term', the term is stored locally in the database (all required properties are already passed via the BioPortal REST interface, so we don't need an additional fetch for that). To actually add a new ontology, we currently rely on the Ontocat library to fetch information about the ontology.

Features to-do

Get database-specific ontology IDs

NCBO ontologies have two IDs: the ID which identifies the ontology in general (e.g. 1132 for the NCBI species ontology) and a version-specific ID (e.g. 38802 for version 1.2 of that ontology). The ontology chooser widget needs them both, so we need to store both in the database.

Via the Ontocat library, it currently quite hard to fetch the versioned ID. What we do now is access the codingScheme property, which has the id somewhere in the string, and extract from there with regular expression "/(\\d{5})/". However, this is not a very nice way, so we probably need to collaborate with the Ontocat project and see if we can get something like a version-specific ontology id getter in there.