From BioAssist
Revision as of 16:15, 7 December 2009 by Keesvb (Talk | contribs)

Jump to: navigation, search

dbNP Features

The end goal of dbNP is to capture and be able to query all types of biological and omics data that are relevant for nutritional studies. This implies three distinct features: capturing study design information, capturing the different types of omics data, and providing an intuitive query mechanism that is able to answer questions that biological researchers commonly have. The following serves as a versioned document describing the functional specifications for the dbNP software deliverable (of which an implementation has been started in the open source project Generic_Study_Capture_Framework. A rough overview of the resulting features from a developer perspective can be found at DbNPFeaturesImplementation.

Capturing study metadata

There are many different paradigms for describing study metadata (study subjects, events, study design information etc.). For some studies, study design is only described in some Ethical Committee forms and in the Methods section of a published paper. Other studies have extensive descriptions in a standardized format such as MAGE-TAB or ISA-TAB (see DbNPInspiration). DbNP has to come up with a study design module that can handle all these differences by using a study metadata model that is flexible and user-adjustable from the very start. We use templating to accomplish this: studies that have the same focus, share the same metadata template. The end user interface reflects this: when describing e.g. a mammalian nutrigenomics study, the user is presented with the right data model to describe such a study. Template administration (e.g. adding fields) can also be done by the user, in fact, this is a cornerstone of the dbNP philosophy, since it is not possible to describe the specifics of all types of study metadata the end user possibly wants to store beforehand.

Storage of study metadata entities

For each study, dbNP should store information about the following (metadata) entities:

  • study (study itself, including the used template)
  • subjects
  • groups
  • protocols
  • events
  • samples
  • assays

The requirements for each of these entities are described in the following paragraphs. See DbNPFeaturesTemplates for a tabular overview of entities, fields and templates.

Study research area

dbNP should employ templates to store metadata about different kinds of biological studies. A template can be defined as a set of information fields for each of the entities in the previous paragraph. 'Study research area' is the term used in the user interface for template. Templates are used in dbNP in the following ways:

  • When entering a new study, the user is presented first with the choice for a particular research area, and after that with a wizard that helps the user entering study metadata according to this template (see User Interface to create and edit study information)
  • The user should be able to create and modify templates, possibly deriving from existing templates (see User Interface to create and edit templates)

Some templates should be pre-defined in dbNP. Those are the templates for:

  • mammalian studies
  • studies about cell cultures
  • studies with micro-organisms
  • studies with plants

See DbNPFeaturesTemplates for the exact content of the templates.

Study information

General information that should be stored about each study is (all fields are mandatory, unless specified otherwise):

  • Study owner (one user)
  • Study editors (list of users, can be empty)
  • Study research area (which is the current template for this study)
  • Study code (text field describing internal study code, not mandatory)
  • Study research question (text field)
  • Study description (text field)
  • Study start date (date)
  • Study ethical committee code (text field, not mandatory)

Study subjects

General information about the subjects (biological organisms, such as mice, plants, cell cultures etc.) that should be stored for each study:

Study groups

It should be possible to define groups on the study subjects. There can be an arbitrary number of groups, and a group is a set of one or more study subjects. For each group, the following should be defined:

  • The member subjects
  • A human readable identifier (string)

Study protocols

Study protocols are protocols that are followed while the study was performed. This can be anything from a certain type of DNA extraction to a protocol for animal care. The following information should be stored about protocols:

Study events

Study events are time-bound applications of a certain protocol to your subjects, such as treatment with a medicine or a glucose challenge. The following information should be stored about study events:

  • Event name (unique string identifier within the study)
  • Event time (date)
  • Event duration (time)
  • Event classification (optional ontology reference)
  • Event protocol(s) (optional references to event protocols)
  • The subject group or individual subject on which the events occured.

The taking of a sample is a special case of an event, a sampling event, that can link the subject(s) on which it was performed to samples. In most cases, every subject will have exactly one resulting sample, but sometimes sampling of one subject will result in several samples, e.g. when the sample is further separated in subsamples (such as the separation of RBCs and plasma from blood).

Study samples

A sample is a piece of biological material that is extracted from a subject (as described in the corresponding sampling event) and on which assays can be performed. The following should be stored about a study sample:

  • Sample name (unique string identifier within the study)
  • Subject from which the sample was taken
  • Sampling event describing the taking of the sample
  • Biological material type (optional ontology reference)

Study assays

A study assay is an assay that is performed on certain study samples. The following information should be stored about the assays:

  • Sample(s) on which it was performed
  • Assay type (ontology reference)
  • Assay platform (ontology reference)
  • Assay data (optional reference to stored clean transcriptomics or metadata in the omics storage part of dbNP)

Storing omics data

The second main task of dbNP is to store 'clean data'. It is a fairly straightforward task to save random datafiles (or even data matrices) and link them to specific assays described in the study design information. However, biologists often express the wish to ask very specific questions to saved data. We can only deal with these types of questions when we now exactly what type of data we are storing, and how to do statistics on it. At this moment, we are focusing on storage of transcriptomics and metabolomics data. To be able to compare these data, we need to convert them into 'clean' data which is cross-comparable among different studies. Clean data can be defined as data that is standardized (such that it can be compared with the same type of data from other studies without a need for further statistical correction or normalization) and properly annotated (such that it refers to standardized biological properties, e.g. genes instead of probe descriptors in the case of transcriptomics).

Storing transcriptomics data

For transcriptomics data, we have two main requirements: we have to store the clean transcriptomics data, and it should be possible to ask some specific transcriptomics queries via the user interface. Data storage is covered in this section, the interfaces for data upload and for transcriptomics queries are covered in the User Interface section. Data from transcriptomics assays should be linked to the study design via (sub)samples: for each assay that is added, it should be specified on which (sub)sample it was performed. Furthermore, specific transcriptomics information ('technology metadata') and the data itself should be stored.

Transcriptomics assay information

The following information should be stored about uploaded transcriptomics assays:

Transcriptomics raw data storage

Any uploaded raw transcriptomics data (e.g. CEL files) should be stored in a dedicated folder on the file server with the experiment identifier as name, which optionally can be served by an external FTP or HTTP webserver later on. For this moment, the stored raw data is only used for data (re)processing.

Transcriptomics data processing

The project should have a dedicated GenePattern installation. Furthermore, it should be possible to issue jobs to convert uploaded raw transcriptomics data into clean transcriptomics data, and get the results back into the dbNP database for querying.

GenePattern installation for data conversion

Each dbNP installation should have an accompanying GenePattern installation, which is can access programatically by using the GenePattern SOAP web services interface. dbNP and the GenePattern installation should share two different folders on the file server. The first is the folder for raw data, described above. The second is an intermediate folder for storing results of GenePattern jobs, which is used to load the results back into dbNP.

Transcriptomics data cleaning

dbNP should provide users with the option of processing uploaded raw transcriptomics data files to clean data via the ExpressionFileCreator module in GenePattern. The output of this module, GCT files, should be stored in a database table specifically designed and indexed to store clean transcriptomics data. It should be possible to track the progress of these jobs (carried out in the background by GenePattern, and also the background job to load the GCT data) in the user interface.

Transcriptomics data adjustment for querying

For some queries, it is necessary to adjust the data. dbNP should be able to this for one type of query: meta-analysis of gene expressions. When transcriptomics experiments are done in different labs, any unsupervised clustering method will cluster the experiments from the different labs together, instead of treatment and control. To correct for this so-called 'batch effect', a statistical correction should be carried out before the pooled analysis is done. N.B. Is implementation of this in the scope of dbNP? Is there a GenePattern module to do this?

Storing metabolomics data

See Metabolomics Datawarehouse Functional Requirements Inventory for requirements on metabolomics data storage. We work together with the programmers team of the NMC DSP to accomplish metabolomics data storage.

Storing clinical chemistry

TODO: make an inventory of the clinical chemistry data from TNO and NuGO studies that need to be persisted into the database.

User Interface of dbNP

dbNP should have a web frontend interface, that enables users to login and access, modify and query the study information in dbNP to which they have access rights. In the start screen of the interface, the user should be provided with the following possibilities:

  • login
  • create an account
  • a full text query for studies in the database that are marked as public
  • informational statistics describing the total number of studies, users and groups in the database

After login, the user should additionally be presented with the following:

  • a list of (the top 10 of) studies that the user recently accessed, with a link to edit and view study information
  • a link to an interface for study browsing
  • a link to the create new study wizard
  • if the user has the Administrator role, a link to an interface for user management
  • if the user has the Template Administrator role, a link to an interface for template management

User management


Users should always have a username and password. They can be assigned to groups or roles by users with the Administrator role.

User groups

User groups should represent different organizations that are submitting data to the dbNP installation. In principal, they only have a name, and they are used to assign specific roles (such as viewing private study data) to all group members, which is done by the group administrator (see roles below).

User roles

Roles can be described as permissions: they define which rights a user has, and the software program should take care that only users who have a specific role can perform the tasks which are targeted by this role. Roles can be assigned to both users and groups. Whenever a user is a member of a group, the user automatically also gets the roles that are associated with this group.

Users can have the following roles and tasks:

  • User (the default role): login, query database for public studies, create and change own studies
  • Administrator (system wide): change users, groups and roles
  • Template administrator: change templates

Additionally, the following roles should be generated for each group (called X here):

  • GroupXAdministrator: add/remove persons to group X
  • GroupXStudyAdministrator: is able to change any study that is owned by users in group X

Finally, some roles are defined implicitly via the Study information.

  • A study owner can edit and delete his/her own study
  • Study editors can also edit the studies of which they are editor

User Interface to view study information

There are two different user interfaces to view study information: one to browse studies, which provides links for each study to the second, the study overview. Furthermore, there is a dedicated views for study groups and sampling events.

Browse studies

The study browser screen should show all studies in a tabular format, with the following fields:

  • study owner (username)
  • study title
  • study description
  • study assays

It should have sorting capabilities, and a smooth paging mechanism (show a limited number of results per page). Furthermore, at the top there should be a simple query box which activates the full text query described below. The study titles should be clickable, and link to the study overview of the selected study.

Study overview

The study overview should give an overview of all information in one study, with edit links for each part. The overview is mainly structured by the main study metadata entities (using e.g. an accordeon widget). The different parts of the overview are:

  • Study information
  • Study subjects
  • Study protocols
  • Study timeline, showing the relation of study subjects, groups, and events
  • Study assays, showing study samples, subsamples and assays

The study timeline should display a timeline spanning all event dates in the study. The timeline consists of all groups in the study, with clickable labels on the left of the diagram, or, if there are no groups defined, all subjects in the study. All sampling events should be shown below the timeline (clickable; with a link to the sampling event view of that sampling event). All other events should be displayed above the timeline, and when the mouse is hovered over an event, a 'tooltip' should appear showing the protocol (with web link to the protocol URI) that was used in that event.

Sampling event view

In the sampling event view, the group or subject on which the sampling event occured are mentioned, and general event information (such as protocol links) are shown. Also, all relations between subjects and samples that are described by this event are shown in a structured table view, with the subjects in the first column, the samples for that subjects in the second column, and the assays performed on that sample in the third column. These columns should be divided further into subcolumns to show the additional fields beside the identifier strings (such as the sample type in case of sample).

Group view

A group view should list all subjects in the group, as a table with subject names in rows and properties in columns. Also, all events should be viewed record-wise below that table. The events should be reported with their names and protocols. Sampling events should be reported with their identifier, give statistics on the associated samples and assays, and provide a link to the sampling event view described above for that sampling event.

User Interface to import/export existing study information and omics data

MAGE-TAB import

In order to be able to import transcriptomics studies from the GEO and ArrayExpress databases (and facilitate users that already uploaded their data there) dbNP needs to implement a MAGE-TAB import function. MAGE-TAB files commonly employs two tab-delimited files: an .idf file, describing the investigation, and a .sdrf file, describing the relation between the experimental setup and the actual transcriptomics assays. Sometimes also a third file is supplied, an .adf file describing the technical array setup (such as loop design, reference design) in MAGE-ML. The import function should read the MAGE-TAB IDF and SDRF files, and convert them into the dbNP study metadata tables on the one hand, and dbNP transcriptomics assay information on the other hand. Furthermore, it should also read in the associated data files into the transcriptomics data storage module.

User Interface to create and edit study information

To facilitate easy entering of new studies, a wizard should be implemented to create new and edit existing studies. This wizard is triggered with the link 'create a new study' on the user start screen, but also when edit links on the whole study (in the study browser, and in the recently used studies list on the start screen) or on parts of the study (in the study overview) are accessed. The wizard should guide the user stepwise through the process of creating/editing a study. The steps are described in the following sections.

Study input

In the first screen, the study research area (template) and the study information should be entered, as specified in the data section. The study information fields should be updated (dynamically) according to the chosen template.

Subjects input

In the second screen, the subjects should be defined, again using the fields specified in the data section plus the additional fields that derive from the template. TODO: in next version, to facilitate cohorts, add an import function for subjects.

Groups input

In the third screen, the possibility should be given to define groups on the subjects. It should be easy to select multiple subjects, and define a group on them. The interface may assume that every subject will just be the member of one group.

Protocols input

In the fourth screen, protocols can be defined, again according to the data schema. These can be protocols that describe events (and thereby also samples), or assays.

Events input

In the fifth screen, the user should be provided with a timeline, featuring the different groups. The user has the opportunity to define general events such as treatments and challenges (which will show up above the timeline) and sampling events (which will show up below the timeline), just as in the study overview timeline. A note should be provided hinting that samples for the sampling events can be defined in the next step.

Samples input

In the sixth screen, it should be possible to link the target subjects to resulting samples for all sampling events defined in the previous step. It should be easy to generate target samples with a structured name (such as the subject name followed by a postfix string) for all target subjects in one go, and set their properties (such as sample type) to the same value with one action.

Assays input

In the seventh screen, it should be possible to define assays that were performed on the samples (which should be displayed grouped by their parent sampling event). It should be very easy to select all samples resulting from one sampling event at once and define a certain assay for all those samples.

Omics data input

In the last screen, it should be possible to upload additional omics data, if some of the assay properties imply that there is associated omics data (in other words, when assay data is defined). All data-containing assays should be listed record-wise, and a link to the target omics data upload procedure (see below) should be provided in each record.

User Interface to add and edit omics data

User Interface to create and edit templates

User Interface to query both metadata and omics data

Full text query on metadata

In several places in the web application, a simple query textbox is shown:

  • in the start screen of the web interface
  • in the study browser

When the user types in a search text and presses enter, a full text query on all available metadata text fields of all studies should be carried out, and the results (all studies with matching metadata fields) should be shown in the study browser interface.

Transcriptomics queries

To be able to respond to the needs of the biologists regarding to querying, we are maintaining a list of queries that should be implemented at dbNPQueries.


Term Definition
Ontology "A formal representation of a set of concepts within a domain and the relationships between those concepts" (Wikipedia). Technically, a set of terms (indexed by accession numbers) that are described in a standardized format, such as OWL (http://www.w3.org/TR/owl-features).
Template Describes a specific biological study type, and which metadata can be stored for that type of study. Technically, for each of the study metadata entities, a (possibly empty) set of information fields that should or can be specified in addition to the general fields for that entity