NMC DSP Requirements

From BioAssist
Jump to: navigation, search

Versioning

17-8-2010 Version by M. Jansen (SARA)

These are long term requirements. For short term requirements see here NMC_DSP_short_term_requirements.

Introduction

The Netherlands Metabolomics Centre (NMC) will be an organization for collaboration, where existing knowledge and expertise of NGI consortia, industry, research institutes, and individual research groups in the medical, plant science, and microbial fields combine in a joint common strategy for future technology development and application. In this coordinated effort, the NMC will focus on the development of standardised and validated technological tools and instruments that can be applied in new strategies for Dutch industry and academia, and clinical centres.

Metabolomics is a complex research field involving the usage of expensive machinery, processing of huge amounts of data, complex statistical analyses and collaboration between research groups. At the time of writing nearly all metabolomics research conducted is mainly done by means of sheets and stand-alone software tools.

The support platform that will be developed will provide a means for communication between the partners of the NMC for the exchange of software, tools and expertise. Moreover, it will enable researchers to collaborate in joint research projects, making it essential that the support platform includes a flexible database or datawarehouse infrastructure for experimental metabolomics data and provides an intrument for data transport. This implicates also that the platform will provide web services, support for software development, and set standards.

Aim and Scope

Aim

The Metabolomics Data Support Platform has two objectives: 1. It is a means to support research projects within the NMC, by providing a data storage, retrieval and query infrastructure, and a data processing infrastructure. 2. It is a means to make software and other tools or protocols developed within the NMC available for the outside (metabolomics) world. To provide one portal where these tools can be accessed makes the efforts of the NMC more visible.

In slighlty more detail the aim of the Metabolomics Data Support Platform is to:

  • Create an infrastructure that will facilitate and standardize metabolomics research from biological question to clean data interpretation
  • Provide a metabolomics data warehouse, being a large repository of valuable metabolomics data which can –eventually- be researched as a whole (including analytical (raw) data, processed data (e.g. metabolite concentrations), study design and other meta data).
  • There should be coonections to a study capture tool in order to graphically define a study and experiments in time
  • Centralized storage of metabolomics data should be secure
  • A metabolomics data processing tool chain should be provided which allows the user to process RAW data to ‘clean’ data in an easy and configurable way
  • Interoperability to external systems (e.g. ISA-TAB output to integrate with dnNP), assimilation linguistics and standards as set by the important initiatives like Metabolomics Standards Initiative (MSI), Architecture for Metabolomics (ArMet) and others.

Scope

It may happen that questions raised are not answered by this functional specification, or that changes to the functional specification are required. If so, all parties that signed off (or have to sign off) must be involved in the change request and need to re-signoff on the new specifications. The change procedure should always be managed by the project lead (Margriet Hendriks) and the project lead should always discuss the proposed change with all involved parties. The project lead is also responsible for maintaining this functional specification document. When agreed upon, the change can be introduced in the functional requirements document, and would require the technical specification to be updated accordingly (see below).

This functional specifications document is the input for the technical specifications document. While this document describes in detail how the DSP should work, who does what and what is to be expected from the system, the technical specification described the technical details of how this should be accomplished (underlying database software, programming language, platform, cron jobs, soap requests, OO-model, web servers, load balancers, storage units, grid techniques, etcetera).

This functional specification document (and hence the scope of the Metabolomics Data Support Platform in development) is signed off and agreed upon by the following parties:

Organization Name Role
UMC MargrietHendriks Project lead
TNO JildauBouwman PI
University of Leiden Theo Reijmers PI
PRI Wageningen Roeland van Ham PI
University of Amsterdam Gooitzen Zwanenburg PI
SARA Machiel Jansen Technical lead

What it will not be

As stated above, the data support platform will facilitate and standardize metabolomics research from biological question to clean data. This means the platform will only support studies biological studies, not as a tool to develop technologies. Dummy studies (containing dummy, fake or empty data) to try to circumvent the biological study limitation are not allowed as data in the central. The data should always have a context in order to be able to perform system-wide studies on public (= published) data. A local installation at DCL may be considered for technology optimization purposes, but is not a focus of the DSP.

Context

The context diagram below shows the interaction of the system with the main external actors. The system can interact with external tools such as Web Services or external data storage. It also will be connected to external metabolite databases. Also the Spectral Tree database, as is being developed in Leiden, will be connected to the DSP. Project leaders will submit research data on the behalf of researchers and will query the system for relevant data.

ContextNMC.jpg

Technicians will upload extra information on samples and possibly data after a processing step. Technicians also require study data to download information on relevant samples which have to be processed in the laboratory. These users may also directly communicate with external tools, but this is not shown in the diagram. Only communication to the DSP is relevant here.


NMC-DSP Components

The NMC DSP consists of two generic components, the datawarehouse and the toolsuite. The first deals with capturing the study data and the storing of data related to a metabolomics stidy. The Toolsuite contains tools and offers the user the possibility to process data by using pipelines, tools and external databases.

The datawarehouse consists of Study data, storage for measurement designs and a store for processed data. Raw data is not stored directly, but the datawarehouse can contain pointers to external storage where raw data resides. Most data files will be linked to samples. The datawarehouse will facilitate the easy uploading of datafiles to individual samples.

Datawarehouse

The components of the datawarehouse are described below in more detail.

Study data

Store study (meta) information. To be able to interpret the measurement, intermediate or clean data you need the meta information of a sample and the related study. The sample meta information is stored in the study capturing part of the NMC-DSP. Features of the study capturing are:

  • Study (owner, name, description, start/end dates, etc)
  • Assay (name, description, etc)
  • Sample (name, reference, customer reference, volume, weight etc)

This part is taken care of by the General Study Capturing Framework (GSCF). It contains the biological study information (Study, Assay, Samples). This part is not metabolomics specific and its requirements are not described here.

General data store

Store measurement, intermediate or clean data grouped by sample. The data warehouse stores links/references to data and not the actual data. The actual data should be stored at institute specific data repositories like a FTP or HTTP(s) server.

This component should facilitate the easy sharing of data between research groups, and protect unwanted access to data.

Features of the data warehouse are:

  • Repository manager: define external repositories like a FTP or HTTP(s) server
  • Workspace manager: setup and access workspaces
  • Share data files between users by interactive access lists
  • Add data files to sample data

Measurement design

Measurement designs are created prior to the measurements of samples. The datawarehouse facilitates storing measurement designs and linking the resulting measurements to samples. Peak tables will be stored and linked to samples.

How data is measured influences how to interpret it. For example, knowing the order and repetition of sample measurements allows you to correct measurement errors. Features of the data measuring are:

  • Create sample batches (name, reference, etc)
  • Measurement Design: create and edit study or assay specific measurement designs by combining sample batches
  • Measurements to sample information

Workspaces

Workspaces are described as mockups on the NMC DSP Workspaces page.

Authentication, authorization and accounting (AAA)

Toolsuite

The toolsuite contains a number of tools which are part of the DSP. Apart from that it also offers access to functionality which is external to the DSP. It is composed of two parts. A pipeline, or simple workflow engine, which allows for the creation of linear workflows, called pipelines.

=Pipeline creation and execution

Features of the pipiline editor are:

  • Tool overview including tool documentation
  • Create and tweak Pipelines, a concatenation of two or more tools as if it were one tool
  • Run tools/pipelines and monitor progress

For example, think of tools used in the pre-processing phase of a study, like file/data conversion tools, noise filters and image generation.


External tools

The DSP will also facilitate access

User Scenario: Adding peak information to samples

We describe a user scenario for adding peak information to samples. We assume the sample data and information about treatment is stored in the GSCF and accessible by the NMC DSP.

The following steps should be taken

Register study

Setup of the study is done by the biologist by registering it in the GSCF application. Based on the experimental design the subjects are treated. Before, during and after the treatment samples are taken at several moments in time. After sampling is done and all sample metadata is registered in the GSCF the biologist contacts the head of the DCL: can you analyse my samples? Together with the head of the DCL the Biologist will then decide how sampling is done and the samples are distributed.

Define measurement design/strategy

Based on sample type, compound(s) of interest, measurement platform and other additional/required meta information the research scientist at the lab sorts out the best possible way to analyse the samples (measurement design). When the biologist, head of DCL and the research scientist agree on the measurement design a planning is made with the technician. When needed the measurements are split into several batches.

Preparing the samples for analysis

Before the samples can be injected into the machine the technician prepares the samples for analysis (sample prep). Based on a protocol samples (aliquots) are created. In addition to the "real" samples other samples like QCs, blanks etc. are added to the set of samples to be analysed.

Analyzing the samples

When the samples are prepped the technician puts them in a sample tray. The position in the tray is determined by the technician, often this is based on the measurement design. The technician then registers the samples in the machine before any measurements are done. The sample tray is then put in the auto-sampler and one by one injected into the machine to be analyzed. Based on the information provided by the technician sample data is stored on the acquisition machine. Sample data at the DCL is stored on a network drive and automatically backed-up every 24-h.

Export the data to Excel

When all samples from a tray have been analyzed the technician checks the data for "obvious" errors that could have occurred during the measurements. When possible correction are made to the data (peak picking). All corrections are stored as a modification on the actual acquired data, no data is modified or deleted. When this is done the technician exports the data to Excel in a (semi) standardized way. Based on the platform/machine vendor this Excel contains per measurement a row with sample reference, some sample meta data and per compound of interest the retention time and peak intensity/area. Before this data is then communicated back to the biologist a quality control is done by a dedicated employee of the lab.

Upload Export to DSP

The lab is now finished and the data should be reported back to the biologist. The technician goes to the data upload section within the study and uploads the exported file via a form to the DSP. When an importer is available the technician can import the measurements into the DSP.

Retrieve measurements and meta information

When the measurements are imported into the DSP the biologist can use the filter and search options from the DSP to retrieve a subset of the data . This could be based on for example: retention time, peak area/intensity, sample type, acquisition time or other sample meta data.

Spectral Trees

The spectral trees requirements are put forward by the group in Leiden. People from Wageningen and TNO are also interested in this functionality.

See here Requirements_DSP_Spectral_trees

Security

See here NMC_security_requirements

The NMC Toolbox

The toolbox is the component of the NMC DSP which will contain a number of tools used by metabolomics researchers. These will support the workflow in metabolomics research from data (pre)processing, statistical analysis and the identification of metabolites.

Tools should function independently of each other, but it should be possible to string them together in a pipeline. These pipelines (linear workflows) can then be stored by users and shared between other members of the NMC. The tools themselves will be accessible by internet access by all users (possibly not NMC related). This means that tools should be either Soap, REST or Webbased. This proves to be difficult for some tools.

The possibility of stringing together command line tools in linear pipelines has already been prototyped.

Main question here is what tools are really wanted by users in this way.

TNO Deco

TNO Deco was one of the first candidates for the NMC DSP to be incorporated in the toolbox. However, TNO Deco is written in Matlab, closed source and its Graphical User Interface is not separated from its functional parts. Therefore, developing a Web Based client for TNO Deco is very difficult. In addition we found that without serious refactoring TNO Deco will be difficult to maintain.

TNO will make a decision whether to open the soource code of TNO Deco.

MzMatch

mzMatch is a Java collection of small commandline tools specific for metabolomics MS data analysis. The tools are built on top of the PeakML core library, providing mass spectrometry specific functionality and access to the PeakML file format. It was developed by Richard Scheltema in Groningen.

This tool was tested and incorporated in the first prototype of the Toolbox. The different components can be selected, executed and stored in a pipeline.

PI's do not seem very enthousiastic about mzMatch. It's use therefore seems very limited.


Workflow tools PRI Wageningen

Wageningen users (at PRI) most often use Metalign. This tool is strongly tied to the MaxMax software and also closed source. Negotiations with the Wageningen group that owns Metalign have been difficult. In addition Wageningen user make use of XCMS, Metot and MsClust. This workflow however depends on an inetrnally developed Metalign based file format. As far as we know, outside Wageningen these tools are not used.


Processing tool Major function
      MetAlign 
Tool for the analysis, alignment and comparison of Mass Spectrometry datasets.
      XCMS 
XCMS is used for removal of experimental artifacts, clean up the data for 
further analysis in the data processing and peak alignments.
      METOT 
pre-processing of MetAlign outputs  – removing irrelevant noisy information
and formatting data for the  next step
      MSClust 
deriving compound mass spectra by ion clustering, data reduction: from a few  
thousands of ions to a few hundreds of metabolites