- 1 dbNP Features
- 1.1 Document versioning
- 1.2 Capturing study metadata
- 1.3 User Interface of dbNP
- 1.3.1 User management
- 1.3.2 User Interface to view study information
- 1.3.3 User Interface to create and edit study information
- 1.3.4 User Interface to add and edit omics data
- 1.3.5 User Interface to create and edit templates
- 1.3.6 User Interface to query both metadata and omics data
- 1.4 Omics submodules
- 1.4.1 Modules overview
- 1.4.2 Transcriptomics module
- 1.4.3 Metabolomics module
- 1.4.4 Clinical chemistry module
- 1.4.5 User Interface to import/export existing study information and omics data
- 1.5 Definitions
The end goal of dbNP is to capture and be able to query all types of biological and omics data that are relevant for nutritional studies. This implies three distinct features: capturing study design information, capturing the different types of omics data, and providing an intuitive query mechanism that is able to answer questions that biological researchers commonly have. The following serves as a versioned document describing the functional specifications for the dbNP software deliverable (of which an implementation has been started in the open source project Generic_Study_Capture_Framework). A rough overview of the resulting features from a developer perspective can be found at DbNPFeaturesImplementation.
|Version||Revision||Changes by||Change log|
|0.1||2379||User:Keesvb||First draft of the dbNP functional specifications|
|0.1.1||2385||User:Keesvb||Deleted the concept of subsamples and introduced sampling events linking subjects to samples|
|0.2||2472||Jildau||Reviewed document and changed assay definition|
|1.0||2631||User:Keesvb||Restructured document, factoring out modules, added protocol parameters, updated transcriptomics and clean chemistry requirements, added mockups and schemas|
Capturing study metadata
There are many different paradigms for describing study metadata (study subjects, events, study design information etc.). For some studies, study design is only described in some Ethical Committee forms and in the Methods section of a published paper. Other studies have extensive descriptions in a standardized format such as MAGE-TAB or ISA-TAB (see DbNPInspiration). DbNP has to come up with a study design module that can handle all these differences by using a study metadata model that is flexible and user-adjustable from the very start. We use templating to accomplish this: studies that have the same focus, share the same metadata template. The end user interface reflects this: when describing e.g. a mammalian nutrigenomics study, the user is presented with the right data model to describe such a study. Template administration (e.g. adding fields) can also be done by the user, in fact, this is a cornerstone of the dbNP philosophy, since it is not possible to describe the specifics of all types of study metadata the end user possibly wants to store beforehand. The metadata module covers all information about the study up to the sample level, and from there on, information about the omics assays that were performed on the samples and the resulting data are described in specific omics submodules. There is one intermediate step in between, the assays description, which is still in the metadata and serves as a link between the samples and the omics that were performed on it.
Storage of study metadata entities
For each study, dbNP should store information about the following (metadata) entities:
- study (study itself, including the used template)
The requirements for each of these entities are described in the following paragraphs. See DbNPFeaturesTemplates for a tabular overview of entities, fields and templates.
Study research area
dbNP should employ templates to store metadata about different kinds of biological studies. A template can be defined as a set of information fields for each of the entities in the previous paragraph, except for groups. 'Study research area' is the term used in the user interface for template. Templates are used in dbNP in the following ways:
- When entering a new study, the user is presented first with the choice for a particular research area, and after that with a wizard that helps the user entering study metadata according to this template (see User Interface to create and edit study information)
- The user should be able to create and modify templates, possibly deriving from existing templates (see User Interface to create and edit templates).
Some templates should be pre-defined in dbNP. Those are the templates for:
- mammalian studies
- studies about cell cultures
- studies with micro-organisms
- studies with plants
See DbNPFeaturesTemplates for the exact content of the templates.
A template field has the following information:
- its parent entity (study, subject etc.)
- field name
- field type (one of string, number, list and ontology reference)
- field unit
- field description
General information that should be stored about each study is (all fields are mandatory, unless specified otherwise):
- Study owner (one user)
- Study editors (list of users, can be empty)
- Study research area (which is the current template for this study)
- Study code (text field describing internal study code, not mandatory)
- Study research question (text field)
- Study description (text field)
- Study start date (date)
- Study ethical committee code (text field, not mandatory)
- Study subjects, groups, protocols,
General information about the subjects (biological organisms, such as mice, plants, cell cultures etc.) that should be stored for each study:
- Identifier (text field)
- Species/ identifier (specified as an NCBI Taxonomy ontology term, see http://www.obofoundry.org/cgi-bin/detail.cgi?id=ncbi_taxonomy, or as NEWT ontology term, see http://www.ebi.ac.uk/newt)
It should be possible to define groups on the study subjects. There can be an arbitrary number of groups, and a group is a set of one or more study subjects. For each group, the following should be defined:
- The member subjects
- A human readable identifier (string)
Study protocols are protocols that are followed while the study was performed. This can be anything from a certain type of tissue extraction to a protocol for animal care. Only protocols that apply to the study design should be stored with the study, protocols describing sample handling and The following information should be stored about protocols:
- Protocol name
- Protocol reference (ontology term in the ArrayExpress ontology reference, http://www.ebi.ac.uk/microarray-as/aer/result?queryFor=Protocol&pAccession=...)
- Protocol parameters
Study protocol parameters
Often certain specific information needs to be stored about a protocol application. For example, when a sample is taken, the amount of resulting biomaterial should be stored as a parameter of the sampling event protocol. The following information should be stored about protocol parameters:
- Protocol parameter name
- Protocol parameter type (string, number, fixed list of items)
- Protocol parameter unit
- Protocol parameter description
- Protocol parameter reference (ontology term in the ArrayExpress ontology reference, http://www.ebi.ac.uk/microarray-as/aer/result?queryFor=Protocol&pAccession=...)
Study events are time-bound applications of a certain protocol to your subjects, such as treatment with a medicine or a glucose challenge. The following information should be stored about study events:
- Event name (unique string identifier within the study)
- Event time (date)
- Event duration (time)
- Event classification/ type (optional ontology reference)
- Event protocol(s) (optional references to event protocols)
- (The subject group or)individual subject/ culture on which the events occured.
The taking of a sample is a special case of an event, a sampling event, that can link the subject(s) on which it was performed to samples. In most cases, every subject will have exactly one resulting sample, but sometimes sampling of one subject will result in several samples, e.g. when the sample is further separated in subsamples (such as the separation of RBCs and plasma from blood).
A sample is a piece of biological material that is extracted from a subject/culture (as described in the corresponding sampling event) and on which assays can be performed. The following should be stored about a study sample:
- Sample name (unique string identifier within the study)
- Subject/ culture from which the sample was taken
- Sampling event describing the taking of the sample
- Biological material type (optional ontology reference)
A study assay is an assay that is performed on certain study samples. The following information should be stored about the assays:
- Sample(s) on which it was performed
- Assay type (ontology reference)
- Assay platform (ontology reference)
- Assay data (optional reference to stored clean data in the corresponding omics module of dbNP)
User Interface of dbNP
dbNP should have a web frontend interface, that enables users to login and access, modify and query the study information in dbNP to which they have access rights. In the start screen of the interface, the user should be provided with the following possibilities:
- create an account
- a full text query for studies in the database that are marked as public
- informational statistics describing the total number of studies, users and groups in the database
After login, the user should additionally be presented with the following:
- a list of (the top 10 of) studies that the user recently accessed, with a link to edit and view study information
- a link to an interface for study browsing
- a link to the create new study wizard
- if the user has the Administrator role, a link to an interface for user management
- if the user has the Template Administrator role, a link to an interface for template management
Users should always have a username and password. They can be assigned to groups or roles by users with the Administrator role.
User groups should represent different organizations that are submitting data to the dbNP installation. In principal, they only have a name, and they are used to assign specific roles (such as viewing private study data) to all group members, which is done by the group administrator (see roles below).
Roles can be described as permissions: they define which rights a user has, and the software program should take care that only users who have a specific role can perform the tasks which are targeted by this role. Roles can be assigned to both users and groups. Whenever a user is a member of a group, the user automatically also gets the roles that are associated with this group.
Users can have the following roles and tasks:
- User (the default role): login, query database for public studies, create and change own studies
- Administrator (system wide): change users, groups and roles
- Template administrator: change templates
Additionally, the following roles should be generated for each group (called X here):
- GroupXAdministrator: add/remove persons to group X
- GroupXStudyAdministrator: is able to change any study that is owned by users in group X
Finally, some roles are defined implicitly via the Study information.
- A study owner can edit and delete his/her own study
- Study editors can also edit the studies of which they are editor
User Interface to view study information
There are two different user interfaces to view study information: one to browse studies, which provides links for each study to the second, the study overview. Furthermore, there is a dedicated views for study groups and sampling events.
The study browser screen should show all studies in a tabular format, with the following fields:
- study owner (username)
- study title
- study description
- study assays
It should have sorting capabilities, and a smooth paging mechanism (show a limited number of results per page). Furthermore, at the top there should be a simple query box which activates the full text query described below. The study titles should be clickable, and link to the study overview of the selected study.
The study overview should give an overview of all information in one study, with edit links for each part. The overview is mainly structured by the main study metadata entities (using e.g. an accordeon widget). The different parts of the overview are:
- Study information
- Study subjects
- Study protocols
- Study timeline, showing the relation of study subjects, groups, and events
- Study assays, showing study samples, subsamples and assays
The study timeline should display a timeline spanning all event dates in the study. The timeline consists of all groups in the study, with clickable labels on the left of the diagram, or, if there are no groups defined, all subjects in the study. All sampling events should be shown below the timeline (clickable; with a link to the sampling event view of that sampling event). All other events should be displayed above the timeline, and when the mouse is hovered over an event, a 'tooltip' should appear showing the protocol (with web link to the protocol URI) that was used in that event.
When comparing multiple studies from the browse studies window, the same study overview is used in a table layout, but without the timeline component.
A group view should list all subjects in the group, as a table with subject names in rows and properties in columns. Also, all events should be viewed record-wise below that table. The events should be reported with their names and protocols. Sampling events should be reported with their identifier, give statistics on the associated samples and assays, and provide a link to the sampling event view described above for that sampling event.
(Sampling) event view
In the sampling event view, the group or subject on which the sampling event occured are mentioned, and general event information (such as protocol links) are shown. Also, all relations between subjects and samples that are described by this event are shown in a structured table view, with the subjects in the first column, the samples for that subjects in the second column, and the assays performed on that sample in the third column. These columns should be divided further into subcolumns to show the additional fields beside the identifier strings (such as the sample type in case of sample).
User Interface to create and edit study information
To facilitate easy entering of new studies, a wizard should be implemented to create new and edit existing studies. This wizard is triggered with the link 'create a new study' on the user start screen, but also when edit links on the whole study (in the study browser, and in the recently used studies list on the start screen) or on parts of the study (in the study overview) are accessed. The wizard should guide the user stepwise through the process of creating/editing a study. The steps are described in the following sections.
In the first screen, the study research area (template) and the study information should be entered, as specified in the data section. The study information fields should be updated (dynamically) according to the chosen template.
In the second screen, the subjects should be defined, again using the fields specified in the data section plus the additional fields that derive from the template. TODO: in next version, to facilitate cohorts, add an import function for subjects.
In the third screen, the possibility should be given to define groups on the subjects. It should be easy to select multiple subjects, and define a group on them. The interface may assume that every subject will just be the member of one group.
In the fourth screen, the user should be provided with a timeline, featuring the different groups. The user has the opportunity to define general events such as treatments and challenges (which will show up above the timeline) and sampling events (which will show up below the timeline), just as in the study overview timeline. A note should be provided hinting that samples for the sampling events can be defined in the next step.
In the fifth screen, protocols can be defined, again according to the data schema. These can be protocols that describe events (and thereby also samples).
In the sixth screen, it should be possible to link the target subjects to resulting samples for all sampling events defined in the previous step. It should be easy to generate target samples with a structured name (such as the subject name followed by a postfix string) for all target subjects in one go, and set their properties (such as sample type) to the same value with one action.
Data input is done in a separate system, in the modules. Therefore, in the seventh screen you can define the type of measurements performed on the samples (assays). From this screen you can then link to a clinical chemistry (eurreca), transcriptomics (Wageningen) or metabolomics (NMC) import screen.
User Interface to add and edit omics data
Clinical chemistry input
In a separate screen, it should be possible to define assays that were performed on the samples (which should be displayed grouped by their parent sampling event). It should be very easy to select all samples resulting from one sampling event at once and define a certain assay for all those samples.
Omics data input
In other separate screens, it should be possible to upload additional omics data with the analytical metadata, if some of the assay properties imply that there is associated omics data (in other words, when assay data is defined). All data-containing assays should be listed record-wise, and a link to the target omics data upload procedure (see below) should be provided in each record.
User Interface to create and edit templates
Templates consist of extra information fields that can be added to entities such as subjects, protocols and samples (see also study research area above). The user interface to create and edit templates should give the user the possibility to create new templates (by starting from scratch, or by copying an existing template) and edit his or her existing templates. Because changing a template can affect all studies with that template, this can be only be done when the user owns all those studies. Otherwise, the user has to create his or her own subtemplate. The list of fields is shared among all templates. This gives the query module the possibility of comparing data of the same type, even if there are many private (sub)templates that are very similar. Changing existing fields should be handled with care too. This can only be done when a possible mapping exists for the current values in the database on these fields to the new situation. This means that changing the type of a field is not possible when there are already studies which have this field. Also, changing a list is limited: values can only be deleted when they are not in use in the database.
The browse templates screen should give an overview of all templates in the database.
The compare templates screen should list the differences between templates, to give the user an idea of the similarity of two templates.
The edit template screen should give the user the opportunity to edit a template.
User Interface to query both metadata and omics data
The query module should give the user the possibility of combining study information from the study capture module with specific omics information for biomarkers that are exposed for the performed assays. To be able to respond to the needs of the biologists regarding to querying, we are maintaining a list of queries that should be implemented at dbNPQueries.
Full text query on metadata
In several places in the web application, a simple query textbox is shown:
- in the start screen of the web interface
- in the study browser
When the user types in a search text and presses enter, a full text query on all available metadata text fields of all studies should be carried out, and the results (all studies with matching metadata fields) should be shown in the study browser interface.
Select studies view
In the first step of defining an advanced query, studies should be selected based on selection criteria.
Select samples view
In the second step of defining an advanced query, the samples in the found studies are viewed, and they can be selected for querying and also be grouped into multiple subgroups (to enable comparison among different axes resulting from the study design).
Select biomarkers view
In the third step of defining an advanced query, the resulting biomarkers that need to be computed can be selected.
Query results view
In the query results view, the resulting biomarker values are shown in a tabular format (condensed when the result set is large, at will by studies, by subjects or by assays). Also, the results should of course be downloadable to CSV for further processing in e.g. R.
As described in the section about study metadata, the assays in the metadata link the different samples to the omics data that results from these assays. Also, metadata regarding these assays (e.g. DNA labeling protocol in transcriptomics) is stored in the corresponding submodule. This enables the submodule to have its own metadata structure. Modules can extend from the datatypes in the study capture part, such as protocols, and add their own fields.
For query purposes, the modules should expose 'biomarkers' to the query module, which allows the query module to combine information from the study capture module with specific omics information from the omics module.
The different modules should expose information about the biomarkers they supply, and also handle data requests.
The general biomarker information that should be exposed is:
- Biomarker name
- Biomarker type (quantitative, qualitative, paired or differential)
- Biomarker unit
- Biomarker description
- Optional species ontology reference
- Optional tissue ontology reference
- Optional gene ontology reference
- Optional metabolite ontology reference
The data requests differ slightly for the different types of biomarkers. They are described in the following paragraphs.
These are the easiest to understand, they represent a quantitative value such as LDL cholesterol in mg/dL. The data request here is a simple request for the values of a selection of samples.
- Request: for assay X, give me the biomarker value for these specific samples Y
- Response: list of samples Y and their biomarker values (= numbers)
These represent a qualitative value such as a weight category. The data request here is a simple request for the values of a selection of samples.
- Request: for assay X, give me the biomarker value for these specific samples Y
- Response: list of samples Y and their biomarker values (= categories)
These represent a response of the same qualitative values among pairs of samples, such the differential expression of the same gene in the same subject before and after an intervention. The data request here is a simple request for the differential values of a selection of sample pairs.
- Request: for assay X, give me the biomarker value for these specific sample pairs Y-Y'
- Response: list of sample pairs Y-Y' and their biomarker values (= numbers)
These represent a response of the same qualitative values between groups of samples, such the differential expression of the same gene in treatment versus control samples. The data request here is a simple request for the differential values of a selection of sample pairs. These could for example be represented by the P-value of a t-test (should be defined in the exposed biomarker information in the biomarker unit field).
- Request: for assay X, give me the biomarker value for sample group Y versus sample group Z
- Response: the differential biomarker value (= number)
Storing omics data
The main task of the modules (and also an important aspect of dbNP) is to store 'clean data' in a structured, queryable way. It is a fairly straightforward task to save random datafiles (or even data matrices) and link them to specific assays described in the study design information. However, biologists often express the wish to ask very specific questions to saved data. We can only deal with these types of questions when we now exactly what type of data we are storing, and how to do statistics on it. At this moment, we are focusing on three different modules: one for transcriptomics, one for metabolomics and one for clinical chemistry data. To be able to compare these types of data across studies, we need to convert them into 'clean' data which is cross-comparable among different studies. Clean data can be defined as data that is standardized (such that it can be compared with the same type of data from other studies without a need for further statistical correction or normalization) and properly annotated (such that it refers to standardized biological properties, e.g. genes instead of probe descriptors in the case of transcriptomics).
For transcriptomics data, we have two main requirements: we have to store the clean transcriptomics data, and it should be possible to ask some specific transcriptomics queries via the user interface (via biomarker exposement). Data storage is covered in this section, the interfaces for data upload and for transcriptomics queries are covered in the User Interface section. Data from transcriptomics assays should be linked to the study design via (sub)samples: for each assay that is added, it should be specified on which (sub)sample it was performed. Furthermore, specific transcriptomics information ('technology metadata') and the data itself should be stored.
Transcriptomics assay information
The following information should be stored about uploaded transcriptomics assays:
- Transcriptomics platform (for arrays as MGED ontology term; for qPCR as custom ontology)
- Extraction protocol
- Labeling protocol
- Hybridization protocol
- Scanning protocol
- Washing protocol
- Physical array design (as ArrayExpress Array Design ontology term, http://www.ebi.ac.uk/microarray-as/aer/result?queryFor=PhysicalArrayDesign&aAccession=...)
- Normalization method (as specified in data processing below)
All protocols should be stored as as MGED ontology term, as used in MAGE-TAB.
Transcriptomics raw data storage
Any uploaded raw transcriptomics data (e.g. CEL files) should be stored in a dedicated folder on the file server with the experiment identifier as name, which optionally can be served by an external FTP or HTTP webserver later on. For this moment, the stored raw data is only used for data (re)processing.
Transcriptomics data processing
The project should have a dedicated GenePattern installation. Furthermore, it should be possible to issue jobs to convert uploaded raw transcriptomics data into clean transcriptomics data, and get the results back into the dbNP database for querying.
GenePattern installation for data conversion
Each dbNP installation should have an accompanying GenePattern installation, which is can access programatically by using the GenePattern SOAP web services interface. dbNP and the GenePattern installation should share two different folders on the file server. The first is the folder for raw data, described above. The second is an intermediate folder for storing results of GenePattern jobs, which is used to load the results back into dbNP.
Transcriptomics data cleaning
dbNP should provide users with the option of processing uploaded raw transcriptomics data files to clean data via the ExpressionFileCreator module in GenePattern. The output of this module, GCT files, should be stored in a database table specifically designed and indexed to store clean transcriptomics data. It should be possible to track the progress of these jobs (carried out in the background by GenePattern, and also the background job to load the GCT data) in the user interface.
Transcriptomics data adjustment for querying
For some queries, it is necessary to adjust the data. dbNP should be able to this for one type of query: meta-analysis of gene expressions. When transcriptomics experiments are done in different labs, any unsupervised clustering method will cluster the experiments from the different labs together, instead of treatment and control. To correct for this so-called 'batch effect', a statistical correction should be carried out before the pooled analysis is done. N.B. Is implementation of this in the scope of dbNP? Is there a GenePattern module to do this?
See Metabolomics Datawarehouse Functional Requirements Inventory for requirements on metabolomics data storage. We work together with the programmers team of the NMC DSP to accomplish metabolomics data storage.
Clinical chemistry module
In mammalian studies, often a number of body fluids are measured routinely by different laboratories, to check for common disease conditions. The clinical chemistry module is aimed at storing the results of these tests in a uniform way. Also, although this is not covered by the name, anthropometric measurements can also be stored in this module, when they cannot properly be stored as subject information in template fields.
Clinical chemistry assay information
Often, in one study the same range of measurements is done on all separate samples in the study. From the study capture perspective, this is seen as one specific clinical chemistry assay (of course, with the possibility of defining a number of assays to reuse a group of measurements, for example a 'standard lipid assay' containing a number of lipoprotein measurements, and an additional 'specific study 123 assay' with specific measurements).
The following is defined for an assay:
- Assay name
- Assay measurements
- Exposed biomarkers to query module
Clinical chemistry clean data
Measurements should be described as follows:
- Measurement name
- Measurement unit
- Metabolite name (ontology?)
- Measurement result (?)
- ID (HMDB)/ EC number
- Statistical significance
- Test info:
- ref values
- supplier/ literature reference
- test name
- ass. disease
- detectable limit
- approval status (EFSA)
- applied method
- correction method (st curve)
- Present in serum
- Present in which organs
Measurement data should be saved per assay as data tables containing a measurement data value for each sample for each measurement in the assay.
User Interface to import/export existing study information and omics data
In order to be able to import transcriptomics studies from the GEO and ArrayExpress databases (and facilitate users that already uploaded their data there) dbNP needs to implement a MAGE-TAB import function. MAGE-TAB files commonly employs two tab-delimited files: an .idf file, describing the investigation, and a .sdrf file, describing the relation between the experimental setup and the actual transcriptomics assays. Sometimes also a third file is supplied, an .adf file describing the technical array setup (such as loop design, reference design) in MAGE-ML. The import function should read the MAGE-TAB IDF and SDRF files, and convert them into the dbNP study metadata tables on the one hand, and dbNP transcriptomics assay information on the other hand. Furthermore, it should also read in the associated data files into the transcriptomics data storage module.
|Ontology||"A formal representation of a set of concepts within a domain and the relationships between those concepts" (Wikipedia). Technically, a set of terms (indexed by accession numbers) that are described in a standardized format, such as OWL (http://www.w3.org/TR/owl-features).|
|Template||Describes a specific biological study type, and which metadata can be stored for that type of study. Technically, for each of the study metadata entities, a (possibly empty) set of information fields that should or can be specified in addition to the general fields for that entity|