Proteomics:Data exchange format

From BioAssist
Jump to: navigation, search
Corra workflow
NBIC IBM workflow

On this page we propose to present information that will support us in the definition of a format for data exchange between different modules of the new NBIC proteomics pipeline (to be developed).

Existing data formats:

New standard formats are defined by HUPO PSI (Proteomics Standards Initiative) [1].

mzML : standard format describing (LC)MS/MS spectra data independent from MS machine format. Description can be found here [2] mzIdentML: exchange format for peptides and proteins identified from mass spectra. Description can be found here [3]

At the ISB/SPC the following open formats were created for the purposes of sharing and processing proteomics data. These formats are used in the TPP (Trans Proteomics Pipeline)[4].

mzXML: mzXML is used for storing spectra-level data.[5]

mzdata.xml : one other standard format more recent then mzXML describing (LC)MS/MS spectra data independent from MS machine format. Description can be found here [6]

pepXML: pepXML is used for storing information about peptide sequence assignment to MS/MS spectra. [7]

protXML: protXML contains proteins and their statistics, as a result of processing peptide level data. [8]

APML (Annotated Putative peptide Markup Language) was developed at ISB as data exchange format in the Corra framework for integrated LC-MS-based quantitative proteomic data analysis. Current distribution of Corra contains APML adapted open source versions of SpecArray and SuperHirn. [9] Corra and its biological application is published in the BMC Bioinformatics 2008 9:542 and is accessible at [10]. There is an active discussion group for Corra users at [11].

One can find example APML files (sample_alignment.apml, sample_alignment_large.apml, sample_peak_lists.apml) along with APML schema .xsd files and html documents in [12]

Actually (according to forum message (12 feb '10), they are also at the beginning stage of evaluation OpenMS. After the performance evaluation for our dataset, they will be considering to plug OpenMS into Corra.

And they say: "if you write code for converting OpenMS output to APML, we would highly appreciate if you could contribute the code for future Corra release".

So, the developers are really open for new implementations.

As for the set-up of the NBIC Proteomics workflow, I have studied the APML format. Although we have realised some limitations of the current APML v2 format, I would propose to start of with investigating the possibility to implement the Corra framework on DAF. Corra has been designed to be used on a computing cluster. It still needs to be checked/tested which additions/modifications are required to implement it on the GRID.

Also, as described in their paper, Corra is implemented such as to be extended with different modules, as they would complement the existing modules for peak detection, alignment, statistics and viewing results.

I think that it will be valuable to implement and test Corra, in order to get experience of implementing a complete pipeline. Then we can complement the existing framework with our own modules. And possibly also improve on the data structure coming to a new version of APML (let's say DPML, for Dutch Proteome Markup Language ;-])

Also, I think it is better to join in an existing project of this size and origin (ISB Seattle, and ETH Zürich), in stead of re-inventing the whole once again, and in the end try to compete in getting it published. Collaboration will provide a better drive, I guess.

Now for your information I have made a workflow scheme of Corra. And in addition I have made a similar scheme of a proposed workflow to be developed with our own tools (as far as i know them yet (Groningen & Wageningen), so more input is requested here).

I made the diagrams with Creately. They are accessible by the following links: Corra [13] and NBIC: [14] and attached as png images

I will also try to put the workflow schema and the APML screenshots (in powerpoint) on the Wiki.