Proteomics:Data Format Meeting on 2010-02-19
Date: 19 February 2010 Location: University of Groningen, Analytical Biochemistry, Antonious Deusinglaan 1, Groningen.
What is the definition of peak list? For Richard it should contain the extracted ion chromatogram of the peaks. Richard peak list contain the raw data about the peak (extracted ion chromatograms). He use centroided data and would like to add profile data. He has index table to have quick access to peak extracted ion chromatograms.
Richard was suggesting to use Corra as starting point as it is open source program and it has all nearly component what the platform need. In Corra there is two open source quantitative data processing software integrate such as Specarray and Supernhirn with subsequent Bioconductor based statistical analysis modules implemented in R. Corra provides user friendly web page and execution of integrated modules on local cluster. We agreed to use APML for peak list and aligned peak matrix format and convert the format of all other tools to it. We should develop further the APML format to extend to all properties that we need preferably in collaboration with the format authors. We should also convert the output format of all DAF integrated tools into APML. For that we need to delegate development task to module, which is able to write parsers and converters between the different output formats of the integrated tools and APML. This taskforce will be also responsible to investigate how it is possible to extend and further develop APML e.g. by adding extracted ion chromatograms of peaks as it used by Richard etc. George and Ishtiaq should also investigate the Corra project and source code to see if parts/modules can be used in DAF for e.g. workflow execution or job running/monitoring on local clusters.
We have designed with concrete tools the first part of the generic proteomics workflow (see the chart at Figure 1). This workflow will only process MS/MS raw acquired with data dependent acquisition with goal to provide list of peptide/proteins quantity in different samples. This will be the first pipeline, which can be used later on to extend to a pipeline where MS/MS data are in separate files than single stage MS data used for quantification.
For that we have TAPP workflow with 4 modules. The open source OMSSA as identification program. We need to write a module, which integrate identification with the quantification information. The programmer of Twan has experience with that and has similar module integrating MSE experience. Task force to make module parsing the output of TAPP workflow to APML is required. Péter has to provide on the ftp server of tool description input and output files and parameters for the 4 modules of TAPP workflow without providing the tools (based on agreement with IBM). Twan will make a more detailed schema of the first and second workflow that we intend to implement.
Figure: Schematic representation of the modules of the first workflow providing annotated quantitative peak matrix in APML from MS/MS data acquired in data dependent mode.
- Twan: provide detailed scheme of the first quantitative quantitative workflow with dedicated program
- Péter: provide input, output and parameters using a test file for the 4 modules of TAPP pipeline
- Twan and Joos has experience in matching peptide and protein identifications to MS1 quantitative data. They will develop a module for matching OMSSA results with TAPP quantitative output
- 1. Brusniak, M. Y.; Bodenmiller, B.; Campbell, D.; Cooke, K.; Eddes, J.; Garbutt, A.; Lau, H.; Letarte, S.; Mueller, L. N.; Sharma, V.; Vitek, O.; Zhang, N.; Aebersold, R.; Watts, J. D. BMC Bioinformatics Corra: Computational framework and tools for LC-MS discovery and targeted mass spectrometry-based proteomics 1. 2008, 9, 542.