Ibidas is aimed to be an integration platform for various sorts of biological data. We hope that by providing a very minimal but flexible database scheme we're able to store any kind of biological data while still keeping the learning curve for using the system low. The data can be browsed using Cytoscape and a web-service is currently under development.
The system consists of a PostgreSQL back-end with Python. The Python code functions as a ORM and resembles, but does not rely upon, SQL Alchemy.
The database structure builds on top of the BioSQL schema, but provides a higher level of abstraction.
The entire database scheme is build around entries, relations and qualifiers, as explained in the next sections.
As Ibidas does not have separate tables for all types of elements the type of an element needs to be stored in the database itself, this is done using the term table. These types can be anything, describing the type of a single entry (gene, protein, SNP, up-stream region) a relation (protein - protein interaction, co-occurrence, co-expression) or a qualifier type (p-value, reference, note).
The Ibidas scheme was designed to be as generic as possible, while still being 'workable' in terms of performance. For this reason a distinction is made between different types of single entities, such as bioconcepts, (data)sets, terms etc.
Every major entry table has an associated relationship table to describe different types of relations. These relationship tables contain an unique identifier, two id fields linking the two entries, a field specifying whether the relationship is directed and a relationship type describing what kind of relationship is being stored.
Qualifiers are used to store 'actual' information, where entries and relations are just abstractions the qualifiers specify in more detail what is being stored.
The qualifier structure is build using Postgre multiple inheritance, each 'major' table receives a number of qualifiers to store specific information. Currently the following data types are available as qualifier:
- value: for numbers
- txt: for storing comments
- data: an array of values
- reference: for storing references to papers
- taxon: species information
- accession: synonyms and different identifiers
- term: for specifying additional types
- (data)set: for specifying arbitrary sets
Python is used to shield the user from the database back-end, and it should be possible to browse all data without ever writing SQL. To achieve this, a multi-layer system was created which basically contains an ORM (Object Relational Mapping) with some command-line facilities.
The database abstraction layer makes sure that, as the name implies, Ibidas can be installed using a number of different databases. This layer manages the dropping and resetting of constraints, loads foreign key relationships, makes sure that ample serials are available etc. This is the lowest Python layer and thus the most difficult to program.
The SQL abstraction layer provides functionality to automatically generate and execute SQL statements. Object relations can be easily traversed and provide the foundations on which the command-line layer is build. The SQL & database abstraction layers together make up the ORM.
The command-line layer provides an easy interface into the database back-end and should be the layer used to write applications on.
The data-import layer consists of a number of scripts which parse input files to extract information from. This information is inserted into the database using the command-line layer.
Currently the Cytoscape link is implemented using XML-RPC, with a server and a client on both the Ibidas and Cytoscape end to facilitate bi-directional communication. Data can be pushed into Cytoscape and the selected nodes can be retrieved. This is all done using the command-line. We are planning to update the Cytoscape plug-in to do this all graphically.
To meet the demands of the BioAssist projects some web-services will need to be implemented. Currently we are investigating different possibilities and scenarios. These web-services will in all likelihood be implemented using something like SOAP or BioMoby. For an overview of Python SOAP libraries see: Python_SOAP_libraries.
The data needs to contain information to link against, this is currently the only available option when integrating large amounts of data. We have looked at two different methods for filling the database, listed below.
This method entails writing a parser for a particular data source and inserting the data into the database.
We'd like to use MRS as a possible method of inserting data, but currently the output of the web-service is not structured. This means that we'll have to write a parser for each data source which is available through MRS, the exact thing that we'd like to avoid when using a service like that.
Currently the database contains:
- NCBI taxons
- Data extracted from Integromics
Additionally, parsers have been written for:
- Marc Hulsman: original design and implementation
- Jan Bot: current maintainer & lead developer
Currently we are updating the SQL abstraction layer to deal more efficiently with the underlying database. See ibidas_design_ideas.