Metabolomics Communication standards
Interfacing with data
A metabolomics data warehouse will require the storage of data on several different levels:
- Raw data (directly from the mass spectrometer)
- Raw data converted to a uniform format (netCDF?)
- Preprocessed data (observed peaks)
- Identified metabolites
- Metadata on all these previous steps, i.e. experiment setup, patient data, but also details on how the(automated) analysis was performed.
Interaction with these components is required from several different perspectives, manual and automated. Manual interaction could be required in the following cases:
- Data upload, of experiments, but also to enter metadata, peak identification, etc.
- Searching and retrieving experiments
- Progress monitoring
However, due to the large amounts of information typically involved in metabolomics analysis, automated interaction must be possible, for example in the following cases:
- Automated data upload,
- For example, batch upload of a backlog, or automated upload from a masspectrometer.
- Data retrieval
- For example, into a statistical package, or a visualization tool.
- Data analysis
- For example, automated preprocessing, peak identification, statistical analysis, etc. This requires interaction with a workflow management tool.
Automated interaction requires a well defined interface. There are many choices possible, but to adhere to the NBIC BioAssist project the use of BioMOBY becomes necessary.
BioMoby is a SOAP based, webservice definition standard that allows service discovery using a central service registry. For more information see the BioMOBY website.
Using BioMoby requires:
- A Biomoby interface around the datawarehouse
- Implementation of analysis tools as a BioMOBY service
BioMOBY is XML based and uses HTTP for transport. This is not an optimal format for the communication of large amounts of data. Since both BioMoby and taverna do not have a solution (yet) to overcome this problem, large amounts of data will be tranported by reference. Tranport by reference has as advantage that the tranport protocol can be optimized for large amounts of data, such as ftp.
Using a tool like BioMOBY does not solve the problem of having to define intermediate data formats! Specifically if the data is transported as a reference, the original data format can still be used. This requires that a data format is chosen for several steps in the analysis.
RAW data storage
Raw data storage involves many different, often proprietary, formats. The original formats will need to be stored for future reference. It is, however, desirable to also have a common format for raw data. This would allow the uniform development and implementation of tools to analyse the raw data. NetCDF seems a commonly used format.
(This is about LCMS and GCMS, needs expansion)
After deconvolution, a list of identified peaks with a m/z and retention time is returned. This list is much more condensed than the original data format. Many formats are around. A suitable format must be able to store peak information, multiple experiments and meta-data on the experiments and peak identification. A possible format is CMLSpect.
If all this data is stored in a single format, that format can then serve as an input to visualization, statistical analysis and many other tools required.