DbNPFeatures

From BioAssist
Jump to: navigation, search

This webpage is no longer maintained and only here for archive purposes. Please refer to https://trac.nbic.nl/gscf and http://dbnp.org for up to date information on this project.


Contents

dbNP Features

The end goal of dbNP is to capture and be able to query all types of study and experimental data that are relevant for biological studies. This implies three distinct features: capturing study design information, capturing the different types of experimental data, and providing an intuitive tool that is able to combine and retreive data in order to answer biological questions. The following serves as a versioned document describing the functional specifications for the dbNP software deliverable (of which an implementation has been started in the open source project Generic_Study_Capture_Framework). A rough overview of the resulting features from a developer perspective can be found at DbNPFeaturesImplementation.

DbNPArchitecture.png

Document versioning

Version Revision Changes by Change log
0.1 2379 User:Keesvb First draft of the dbNP functional specifications
0.1.1 2385 User:Keesvb Deleted the concept of subsamples and introduced sampling events linking subjects to samples
0.2 2472 User:Jildau Reviewed document and changed assay definition
1.0 2631 User:Keesvb Restructured document, factoring out modules, added protocol parameters, updated transcriptomics and clean chemistry requirements, added mockups and schemas
1.1 2764 User:Jildau Reviewed document
1.2 2776 User:Keesvb Added 'biomarker layer' specifications
1.2.1 2936 User:Keesvb Clarified clinical data layer

Capturing study metadata

There are many different paradigms for describing study metadata (study subjects, events, study design information etc.). For some studies, study design is only described in some Ethical Committee forms and in the Methods section of a published paper. Other studies have descriptions in a standardized format such as MAGE-TAB or ISA-TAB (see DbNPInspiration). However, often a good retrieval system is lacking. DbNP has to come up with a study design module that can handle all these data by using a study metadata model that is flexible and user-adjustable from the very start. The module should promote usage of standardized terms and relations, in such a way that comparisons of data is facilitated. We use templating to accomplish this: studies that have the same focus, share the same metadata template and includes ontologies. The end user interface reflects this: when describing e.g. a mammalian nutrigenomics study, the user is presented with the right data model to describe such a study. Template administration (e.g. adding fields) can also be done by the user with administrator rights, in fact, this is a cornerstone of the dbNP philosophy, since it is not possible to describe the specifics of all types of study metadata the end user possibly wants to store beforehand. The study description module covers all information about the study up to the sample level, and from there on, information about the (omics) assays that were performed on the samples and the resulting data are described in specific analytical submodules. There is one intermediate step in between, the different assays performed on the sample are saved in the study description module(to make it clear in which other modules data can be found). The sampleID forms the link between the study description application and the other analytical modules.

Storage of study metadata entities

For each study, dbNP should store information about the following (metadata) entities:

  • study (study itself, including the used template)
  • subjects
  • groups
  • events
  • samples
  • protocols
  • assays

The requirements for each of these entities are described in the following paragraphs. See DbNPFeaturesTemplates for a tabular overview of entities, fields and templates.

Study research area

dbNP should employ templates to store metadata about different kinds of biological studies. A template can be defined as a set of information fields for each of the entities in the previous paragraph, except for groups. 'Study research area' is the term used in the user interface for template. Templates are used in dbNP in the following ways:

  • When entering a new study, the user is presented first with the choice for a particular research area, and after that with a wizard that helps the user entering study metadata according to this template (see User Interface to create and edit study information)
  • The user should be able to create templates, possibly deriving from existing templates (see User Interface to create and edit templates). This template creator is automatically administrator of the template and can make other users editors or administators. Other users can only use the template or use it to define a new template.

Some templates should be pre-defined in dbNP. Those are the templates for:

  • mammalian studies
  • studies about cell cultures
  • studies with micro-organisms
  • studies with plants

Administrators for the templates will be appointed. See DbNPFeaturesTemplates for the exact content of the templates.

A template field has the following information:

  • its parent entity (study, subject etc.)
  • field name
  • field type (one of string, number, list and ontology reference)
  • field unit
  • field description
  • comments field

Study information

General information that should be stored about each study is (all fields are mandatory, unless specified otherwise):

  • Study name
  • Study code (text field describing internal study code, either study name or code should be filled out)
  • Study owner (one user)
  • Study editors (list of users, can be empty)
  • Study readers (list of users (also a possibility of all users or public, can be empty)
  • Study research area (which is the current template for this study)
  • Study research question (text field: add note that people can refer to a file for further details)
  • Study description (text field: add note that people can refer to a file for further details)
  • Study start date (date)
  • Study ethical committee code (text field, not mandatory: this field should be moved to the mouse and human template)
  • Number of study subjects

Study subjects/culture

General information about the subjects (biological organisms, such as mice, plants, cell cultures etc.) that should be stored for each study:

Study groups

It should be possible to define groups on the study subjects. There can be an arbitrary number of groups, and a group is a set of one or more study subjects. For each group, the following should be defined:

  • Name (human readable identifier: string)
  • The member subjects

Study events

Study events are time-bound applications of a certain protocol to your subjects, such as treatment with a medicine or a glucose challenge. The following information should be stored about study events:

  • Event name (unique string identifier within the study)
  • Event time (date)
  • Event duration (time)
  • Event classification/ type (optional ontology reference)
  • Event protocol(s) (optional references to event protocols)
  • (The subject group or) individual subject/ culture on which the events occured.
  • Event description (explaining the reasoning behind the event)

The taking of a sample is a special case of an event, a sampling event, that can link the subject(s) on which it was performed to samples. In most cases, often sampling of one subject will result in several samples, e.g. when the sample is further separated in subsamples (such as the separation of RBCs and plasma from blood or removal of several organs).

Study samples

A sample is a piece of biological material that is extracted from a subject/culture (as described in the corresponding sampling event) and on which assays can be performed. The following should be stored about a study sample:

  • Sample name (unique string identifier within the study)
  • Subject/ culture from which the sample was taken
  • Sampling event describing the taking of the sample
  • Biological material type (optional ontology reference) (This should be stored in the templates, as there is no such thing in cell cultures)
  • Amount of sample

Study protocols

Study protocols are protocols that are followed while the study was performed. This can be anything from a certain type of tissue extraction to a protocol for animal care. Only protocols that apply to the study design should be stored with the study, protocols describing the analytic method and sample extraction should be stored in the analytical modules. The following information should be stored about protocols:

Often certain specific information needs to be stored about a protocol application. For example, when a sample is taken, the goal amount of resulting biomaterial should be stored as a parameter of the sampling event protocol. The following information should be stored about protocol parameters:

Study assays

A study assay is an assay that is performed on certain study samples. The following information should be stored about the assays:

  • Sample(s) on which it was performed
  • Assay type (ontology reference)
  • Assay platform (ontology reference)

The actual data can be retrieved via the query tool and are linked via the sampleID.

User Interface of dbNP

DbNPStart page.png

dbNP should have a web frontend interface, that enables users to login and access, modify and query the study information in dbNP to which they have access rights. In the start screen of the interface, the user should be provided with the following possibilities:

  • login
  • create an account
  • a full text query for studies in the database that are marked as public
  • informational statistics describing the total number of studies, users and groups in the database

DbNPLogged in page.png

After login, the user should additionally be presented with the following:

  • a list of (the top 10 of) studies that the user recently accessed, with a link to edit and view study information
  • a link to an interface for querying all accessible studies
  • a link to an interface for study browsing
  • a link to the create new study wizard
  • if the user has the Administrator role, a link to an interface for user management
  • if the user has the Template Administrator role, a link to an interface for template management


User management

Users

Users should always have a username and password. They can be assigned to roles by users with the Administrator role.

User roles

Roles can be described as permissions: they define which rights a user has, and the software program should take care that only users who have a specific role can perform the tasks which are targeted by this role. Roles can be assigned to both users and groups. Whenever a user is a member of a group, the user automatically also gets the roles that are associated with this group.

Users can have the following roles and tasks:

  • User (the default role): login, query database for public studies, create and change own studies
  • Administrator (system wide): change users, groups and roles
  • Template administrator: change templates

Finally, some roles are defined implicitly via the Study information.

  • A study owner can edit and delete his/her own study
  • Study editors can also edit the studies of which they are editor

User Interface to view study information

There are two different user interfaces to view study information: one to browse studies, which provides links for each study to the second study view, the study overview. Furthermore, there are dedicated views for study groups and sampling events.

Browse studies

DbNPBrowse studies.png

The study browser screen should show all studies in a tabular format, with the following fields:

  • study owner (username)
  • study title
  • study description
  • study assays
  • study events

It should have sorting capabilities, and a smooth paging mechanism (show a limited number of results per page). Furthermore, at the top there should be a simple query box which activates the full text query described below. The study titles should be clickable, and link to the study overview of the selected studies.

Study overview

DbNPCompare studies.png

The study overview should give an overview of all information in one study, with edit links for each part. The overview is mainly structured by the main study metadata entities (using e.g. an accordeon widget). The different parts of the overview are:

  • Study information
  • Study subjects
  • Study protocols
  • Study timeline, showing the relation of study subjects, groups, and events
  • Study assays, showing study samples, subsamples and assays

The study timeline should display a timeline spanning all events in the study. The timeline consists of all groups in the study, with clickable labels on the left of the diagram, or, if there are no groups defined, all subjects in the study. All sampling events should be shown below the timeline (clickable; with a link to the sampling event view of that sampling event). All other events should be displayed above the timeline and showing the span of the event. When the mouse is hovered over an event, a 'tooltip' should appear showing the protocol (with web link to the protocol URI) that was used in that event.

When comparing multiple studies from the browse studies window, the same study overview is used in a table layout, but without the timeline component.

Group view

A group view should list all subjects in the group, as a table with subject names in rows and properties in columns. Also, all events should be viewed record-wise below that table. The events should be reported with their names and protocols. Sampling events should be reported with their identifier, give statistics on the associated samples and assays, and provide a link to the sampling event view described above for that sampling event.

(Sampling) event view

DbNPSampling event view.png

In the sampling event view, the group or subject on which the sampling event occured are mentioned, and general event information (such as protocol links) are shown. Also, all relations between subjects and samples that are described by this event are shown in a structured table view, with the subjects in the first column, the samples for that subjects in the second column, and the assays performed on that sample in the third column. These columns should be divided further into subcolumns to show the additional fields beside the identifier strings (such as the sample type in case of sample).

User Interface to create and edit study information

To facilitate easy entering of new studies, a wizard should be implemented to create new and edit existing studies. This wizard is triggered with the link 'create a new study' on the user start screen, but also with edit links on the whole study (in the study browser, and in the recently used studies list on the start screen) or on parts of the study (in the study overview) are accessed. The wizard should guide the user stepwise through the process of creating/editing a study. The steps are described in the following sections.

Study input

In the first screen, the study research area (template) and the study information should be entered, as specified in the data section. The study information fields should be updated (dynamically) according to the chosen template.

Subjects input

In the second screen, the subjects should be defined, again using the fields specified in the data section plus the additional fields that derive from the template. The number of subjects defined in the first -study input- screen should be already predefined in the second screen (named temporarily subject1 , subject2 etc) TODO: in next version, to facilitate cohorts, add an import function for subjects.

Groups input

In the third screen, the possibility should be given to define groups on the subjects. It should be easy to select multiple subjects, and define a group on them. The interface may assume that every subject will just be the member of one group.

Events input

In the fourth screen the events (such as treatments and challenges) are defined, using the fields specified in the data section extended with the fields from the template.

DbNPTimeline view.png

In this screen, a timeline is also provided, featuring the different groups. The user has the opportunity to position the events above the timeline. Below the timeline sampling moments (sampling events) can be added, just as in the study overview timeline. A note should be provided hinting that samples for the sampling events can be defined in the next step.

Samples input

In the sixth screen, it should be possible to link the target subjects to resulting samples for all sampling events defined in the previous step. It should be easy to generate target samples with a structured name (such as the subject name followed by the biological material or any default postfix string) for all target subjects in one go, and set their properties (such as sample type) to the same value with one action.

Protocols input

In the fifth screen, protocols can be defined, again according to the data schema. The screen will show all defined events and samples (to remind the user to add all used protocols) to which protocol information can be linked.

Assays input

Data input is done in a separate system, in the modules. Therefore, in the seventh screen you can define the type of measurements performed on the samples (assays). From this screen you can then link to a clinical chemistry (eurreca), transcriptomics (Wageningen) or metabolomics (NMC) import screen.

User Interface of the analytical modules to add and edit experimental (clean) data

There is no user interface in the dbNP metadata/query section to add clean data (such as transcriptomics, metabolomics etc.) This is done in the various submodules (and also the data is stored there). However, it is possible to query this data, via the query module of dbNP.

User Interface to create and edit templates

Template views and administration

Templates consist of extra information fields that can be added to entities such as subjects, protocols and samples (see also study research area above). The user interface to create and edit templates should give the user the possibility to create new templates (by starting from scratch, or by copying an existing template) and edit his or her existing templates. Because changing a template can affect all studies with that template, this can be only be done by the administrator of the template. Otherwise, the user has to create his or her own subtemplate. The list of fields is shared among all templates. This gives the query module the possibility of comparing data of the same type, even if there are many private (sub)templates that are very similar. Changing existing fields should be handled with care too. This can only be done when a possible mapping exists for the current values in the database on these fields to the new situation. This means that changing the type of a field is not possible when there are already studies which have this field. Also, changing a list is limited: values can only be deleted when they are not in use in the database.

Browse templates

DbNPTemplate management.png

The browse templates screen should give an overview of all templates in the database. The administrator of the template will see edit links on all template views.

Compare templates

DbNPCompare templates.png

The compare templates screen should list the differences between templates, to give the user an idea of the similarity of two templates and can be used as suggestions for additional field for a certain template.

Edit template

DbNPEdit template.png

The edit template screen should give the user the opportunity to edit a template.

User Interface to query both metadata and omics data

Query overview

The query module should give the user the possibility of combining study information from the study capture module with analytical data (measurements and metadata) from the different analytical modules. To be able to respond to the needs of the biologists regarding to querying, we are maintaining a list of queries that should be implemented at dbNPQueries.

Full text query on metadata

In several places in the web application, a simple query textbox is shown:

  • in the start screen of the web interface
  • in the study browser

When the user types in a search text and presses enter, a full text query on all available metadata text fields of all studies should be carried out, and the results (all studies with matching metadata fields) should be shown in the study browser interface.

Select studies view

DbNPQuery studies.png

In the first step of defining an advanced query, studies should be selected based on one or more selection criteria (f.i. treatment and/or species).

Select samples view

DbNPQuery samples.png

In the second step of defining an advanced query, the samples in the found studies are viewed, and they can be selected for querying and also be grouped into multiple subgroups (to enable comparison among different axes resulting from the study design).

Select biomarkers view

Query biomarkers.png

In the third step of defining an advanced query, the resulting biomarkers that need to be computed or retrieved can be selected.

Query results view

In the query results view, the resulting biomarker values are shown in a tabular format (condensed when the result set is large, at will by studies, by subjects or by assays). Also, the results should of course be downloadable to CSV for further processing in e.g. R. TO DO: make mockup (challenge: measurement data and study data should be presented)

Omics submodules

Modules overview

DbNPBiomarkerLayer.png

Modularization

As described in the section about study metadata, the assays in the metadata link the different samples to the omics data that results from these assays. Also, metadata regarding these assays (e.g. DNA labeling protocol in transcriptomics) is stored in the corresponding submodule. This enables the submodule to have its own metadata structure.

Biomarker view

For query purposes, the modules should expose 'biomarkers' to the query module, which allows the query module to combine information from the study capture module with specific omics information from the omics module.

Analytical submodule requirements

The different submodules should expose information about the data they supply, and also handle data requests.

The information (or metadata) about the data the modules supply can be given on two levels: on the assay level, and also on the sample/biomarker/molecule level. An assay is defined as a series of the same measurements that are carried out on a number of samples (often also at the same time, or in a short timespan). An example: the transcriptomics module exposes assays (a batch of microarrays done on a number of samples at the same time), of which some (quantitative) biomarkers can be the absolute gene expressions (e.g. Affymetrix gene expression levels for all genes that have probes on the microarray), but also e.g. (as a differential biomarker) the differential expression of a certain gene with respect to a defined grouping of the samples (defined in the query module which asks for the biomarker), or the same for all genes in a certain pathway. For the clinical chemistry module, the assay could be a number of routine blood measurements ('Blood measurements'), and the (quantitative) biomarkers the measured blood metabolite levels ('HDL-C level','Glucose concentration').

The general assay information/assay metadata fields can vary between modules, but the following information about those fields should be exposed:

  • Assay metadata field names
  • Assay metadata field type (string/number/item from predefined list/ontology reference)
  • Assay metadata field unit (for numbers)
  • Assay metadata field descriptions

Examples of assay metadata:

  • normalization method
  • etc.

Also, it should be possible to ask the module which biomarkers it can supply for a certain asay. The biomarker information that should be exposed is:

  • Biomarker name
  • Biomarker type (quantitative, qualitative, paired or differential: if paired also a description what is paired to what)
  • Measurement unit (if number; optional)
  • Optional species ontology reference
  • Optional tissue ontology reference
  • Optional gene ontology reference
  • Optional metabolite ontology reference
  • Also here, the module can define its own extra metadata properties on the biomarkers. This should be general information about the test/assay/measurement (e.g. the detectable limit), as opposed to information about a specific application of the measure (e.g. the standard deviation of a measured value). In the current biomarker layer specification, the latter cannot be stored directly, but could of course be exposed as a separate biomarker (e.g. sample quality).

As a result, there can be several (meta)data requests to a submodule. The following is a list of example requests and responses.

Metadata description requests:

  • Request: give me all the metadata fields available for assays
  • Reponse: a list of assay metadata field names, types (string/number/item from predefined list/ontology reference), units (for numbers) and descriptions
  • Request: give me all the biomarker (descriptions) that are available for assay X
  • Reponse: a list of biomarker names, types (quantitative/qualitative/paired/differential), units (for numbers), descriptions and ontology references of the biomarkers that are available for assay X
  • Request: give me all the extra metadata fields that are available for biomarkers in assay X
  • Reponse: a list of extra biomarker metadata field names, types (string/number/item from predefined list/ontology reference), units (for numbers) and descriptions that are available for biomarkers in assay X
  • Request: give me the values of all (or some specific) extra metadata fields for the biomarkers B1 and B2
  • Reponse: a list of the biomarker names and the values of the extra metadata fields for these biomarkers

Actual metadata requests:

  • Request: give me all (or certain) assay metadata fields for assay X
  • Response: a list of the assay metadata field names, and their values for assay X (as name-value pairs)

Actual data requests:

The submodule data requests differ slightly for the different types of biomarkers. They are described in the following paragraphs.

Quantitative biomarkers

These are the easiest to understand, they represent a quantitative value such as LDL cholesterol in mg/dL. The data request here is a simple request for the values of a selection of samples.

  • Request: for assay X, give me the biomarker value for these specific samples Y
  • Response: list of samples Y and their biomarker values (= numbers)
Qualitative biomarkers

These represent a qualitative value such as a weight category. The data request here is a simple request for the values of a selection of samples.

  • Request: for assay X, give me the biomarker value for these specific samples Y
  • Response: list of samples Y and their biomarker values (= categories)
Paired biomarkers

These represent a response of the same qualitative values among groups of samples, such as the differential expression of the same gene in the same subject before and after an intervention. The data request here is a simple request for the differential values of a selection of sample pairs.

  • Request: for assay X, give me the biomarker value for these specific sample pairs Y-Y'
  • Response: list of sample pairs Y-Y' and their biomarker values (= numbers)
Differential biomarkers

These represent a response of the same qualitative values between groups of samples, such as the differential expression of the same gene in treatment versus control samples. The data request here is a simple request for the differential values of a selection of sample pairs. These could for example be represented by the P-value of a t-test (should be defined in the exposed biomarker information in the biomarker unit field).

  • Request: for assay X, give me the biomarker value for sample group Y versus sample group Z
  • Response: the differential biomarker value (= number)

Storing omics data

The main task of the modules (and also an important aspect of dbNP) is to store 'clean data' in a structured, queryable way. It is a fairly straightforward task to save random datafiles (or even data matrices) and link them to specific assays described in the study design information. However, biologists often express the wish to ask very specific questions to saved data. We can only deal with these types of questions when we now exactly what type of data we are storing, and how to do statistics on it. At this moment, we are focusing on three different modules: one for transcriptomics, one for metabolomics and one for clinical chemistry data. To be able to compare these types of data across studies, we need to convert them into 'clean' data which is cross-comparable among different studies. Clean data can be defined as data that is standardized (such that it can be compared with the same type of data from other studies without a need for further statistical correction or normalization) and properly annotated (such that it refers to standardized biological properties, e.g. genes instead of probe descriptors in the case of transcriptomics).

Transcriptomics module

For transcriptomics data, we have two main requirements: we have to store the clean transcriptomics data, and it should be possible to ask some specific transcriptomics queries via the user interface (via biomarker exposement in the biomarker layer). Data from transcriptomics assays should be linked to the study design via the sample and assay IDs. Furthermore, specific transcriptomics information ('technology metadata') and the data itself should be stored.

The transcriptomics module is realized by Wageningen University & Research Centre as described in DbNPCleanTranscriptomicsDatabase.

Transcriptomics assay information

The following information should be stored about uploaded transcriptomics assays:

All protocols should be stored as as MGED ontology term, as used in MAGE-TAB.

Transcriptomics raw data storage

Any uploaded raw transcriptomics data (e.g. CEL files) should be stored in a dedicated folder on the file server with the experiment identifier and sampleID as name, which optionally can be served by an external FTP or HTTP webserver later on. For this moment, the stored raw data is only used for data (re)processing.

Transcriptomics data processing

The project should have a dedicated GenePattern installation. Furthermore, it should be possible to issue jobs to convert uploaded raw transcriptomics data into clean transcriptomics data, and get the results back into the clean transcriptomics database for querying.

GenePattern installation for data conversion

Each dbNP installation should have an accompanying GenePattern installation, which it can access programatically by using the GenePattern SOAP web services interface. dbNP and the GenePattern installation should share two different folders on the file server. The first is the folder for raw data, described above. The second is an intermediate folder for storing results of GenePattern jobs, which is used to load the results back into dbNP.

Transcriptomics data cleaning

dbNP should provide users with the option of processing uploaded raw transcriptomics data files to clean data via the ExpressionFileCreator module in GenePattern. The output of this module, GCT files, should be stored in a database table specifically designed and indexed to store clean transcriptomics data. It should be possible to track the progress of these jobs (carried out in the background by GenePattern, and also the background job to load the GCT data) in the user interface.

Transcriptomics data adjustment for querying

For some queries, it is necessary to adjust the data. The most simple example is the calculation of fold changes or differential expression of genes between groups of samples. Also, the transcriptomics module should be able to adjust for meta-analysis of gene expressions over multiple experiments. When transcriptomics experiments are done in different labs, any unsupervised clustering method will cluster the experiments from the different labs together, instead of treatment and control. To correct for this so-called 'batch effect', a statistical correction should be carried out before the pooled analysis is done. The transcriptomics module should infer whether this correction is necessary, from the assay metadata that is stored in the module.

Transcriptomics module User Interface

The transcriptomics module should have a webinterface that has at least the following functionalities:

  • Upload raw data
  • Upload annotation data
  • Normalization and annotation of gene expression data via GenePattern

Metabolomics module

See Metabolomics Datawarehouse Functional Requirements Inventory for requirements on metabolomics data storage. We work together with the programmers team of the NMC DSP to accomplish metabolomics data storage.

Clinical data module

Overview

In mammalian studies, often a number of body fluids are measured routinely by different laboratories, to check for common disease conditions. The clinical chemistry module is aimed at storing the results of these tests in a uniform way. Also, although this is not covered by the name, anthropometric measurements can also be stored in this module, when they cannot properly be stored as subject information in template fields.

Clinical chemistry assay information

Often, in one study the same range of measurements is done on all separate samples in the study. From the study capture perspective, this is seen as one specific clinical chemistry assay (of course, with the possibility of defining a number of assays to reuse a group of measurements, for example a 'standard lipid assay' containing a number of lipoprotein measurements, and an additional 'specific study 123 assay' with specific measurements).

The following is defined for an assay (by definition of the biomarker layer):

  • Assay name
  • Assay measurements
  • Exposed biomarkers to query module

Clinical chemistry clean data

There are no specific assay metadata fields for the clinical chemistry module. However, a lot of information about the different measurements is given via the biomarker metadata fields. The biomarkers correspond to the measurements (e.g. LDL cholesterol, insulin concentration) that were carried out in the assay.

The biomarker metadata fields that the clinical chemistry module should expose are:

Of course, the standard biomarker description fields:

  • Measurement name --> metabolite name
  • Measurement unit
  • Metabolite ontology reference: ID (HMDB)
  • Enzyme ontology reference: EC number
  • Organism part (organ) ontology reference
  • Compound ontology reference (CheBI)
  • Drug ontology reference (DrugBank)

And as extra fields measurement/test info:

  • reference values (String)
  • detectable limit (Float)
  • correction method (st curve): String (descriptional)
  • drug (also available as drug, Category: Yes/No)
  • intake (also present in food, Category: Yes/No)

And also, linked from the NuGO Wiki:

  • associated disease(s) (Ontology)
  • present in serum: Category (Yes/No)

All of this information refers to the test/measurement in general, not to the specific values (which can be obtained by asking for the actual biomarker values for specific assays via the biomarker layer). Measurement data should be saved per assay as data tables containing a measurement data value for each sample for each measurement in the assay. The exposed biomarkers should map to the measurements, both as quantitative and differential (t-test).

The assay metadata that the module should expose is:

  • supplier/ literature reference (String)
  • approval status (EFSA) (Category: Yes/No)
  • applied method: String (URL)
  • SOP: String (URL)

User Interface to import/export existing study information and omics data

Tabular import

dbNP should have a generic import tool that is able to read any tabular format (CSV, comma- or tab delimited) or Excel files and import these into 1 certain study (that already should exist). The importer should parse the input file, and then give the user the possibility to map each of the columns (or rows) in the input file to entities/properties in the study metadata or clinical data.

Furthermore, after the import is ready, the user should be able to save the import settings into the database, so that it will be easy to import the same type of file again without repeating all the import description steps.

This is done in a number of steps, most of which (steps 2 and further) are all in one big screen, essentially a view of the imported data table with options to define where the data in the columns should go.

General information

In the first step, the user should make clear whether the main data entries are in the rows or in the columns. This is simply done by assuming that the different entities (studies, samples etc.) are in the rows, and the different properties (e.g. different clinical data measurements) are in the columns. There should be a Transpose button, which enables the user to swap the import table if this is not the case.

Furthermore, the user should select what the smallest entity is (Subject, Event or Sample), and what the identifier column of the smallest entity is. For example, if a file has information on the subjects (which is probably very redundant, the same information is repeated over and over again), on events and on samples, the smallest entity would be sample. The user has to make clear in which column the import wizard can find the sample name (not id, since identifiers are always generated automatically by the database once entities are saved).

Finally, the user should indicate if there is also clinical data in the sheet, and if so, which Assay(s) was/were done.

Entity/property mapping

Next, the user should choose the data type for each column (string, float etc.) and be able to map each column to a particular entity and property. For example, a string column describing the subject name should map to entity Subject, property name. Entities can be: Subject, Event, Sample, or one of the defined clinical data Assays (if any) Properties can be: the properties of the chosen entity (see domain classes) or the measurements in the chosen clinical data assay

Not all columns need to be mapped, the user can choose not to import certain columns, in fact that is the default setting for each column to begin with.

Validation

Finally, the data should be validated in two ways: the content should be checked for its type (e.g. numbers should be parsable to numbers, dates to dates), and it should be checked if the entities are consistent. Whenever e.g. multiple samples are defined for a certain subject, and there is also subject information, this subject information is repeated in every sample row. It should be checked if this data is consistent (because otherwise, the import function cannot know which value to choose).

The final import step involves creating all the entities described (starting from the biggest one, probably Subject) and assigning all the properties, and creating the actual clinical data assay instance and adding the measurements values to it (if any).

MAGE-TAB import

In order to be able to import transcriptomics studies from the GEO and ArrayExpress databases (and facilitate users that already uploaded their data there) dbNP needs to implement a MAGE-TAB import function. MAGE-TAB files commonly employs two tab-delimited files: an .idf file, describing the investigation, and a .sdrf file, describing the relation between the experimental setup and the actual transcriptomics assays. Sometimes also a third file is supplied, an .adf file describing the technical array setup (such as loop design, reference design) in MAGE-ML. The import function should read the MAGE-TAB IDF and SDRF files, and convert them into the dbNP study metadata tables on the one hand, and dbNP transcriptomics assay information on the other hand. Furthermore, it should also read in the associated data files into the transcriptomics data storage module.

ISA-TAB import and export

In order to be able to relate to other activities the study description module should be able in import and export ISA-TAB files.


Module Communication and Rest Resources

Internally, modules communicate using Rest services . The communication is managed using Grail's built in Rest features and a Communication Manager API.

The restful resources currently available are as follows.


Rest Resources available on GSCF

  • rest/getStudies
    • Description: Get list of all externalStudyIDs.
    • Parameter query: none.
    • Return value: List of all externalStudyIDs.


  • rest/getSubjects
    • Description: Give a list of all subjects for a study.
    • Parameter query: externalStudyID - this is the code of a Study domain object in GSCF.
    • Return value: List of subjects' names.


  • rest/getAssays
    • Description: Give list of all assays of a study.
    • Parameter query: externalStudyID - the code of a Study domain object in GSCF.
    • Return value: list of externalAssayIDs.


  • rest/getSamples
    • Description: This returns everything that is necessary to collect all samples related to a specific assay.
    • Parameter query: externalStudyID - the code of a Study domain object in GSCF.
    • Return value: A list maps, that map on the fields that are neccessary to uniquely identify a Sample in GSCF. Each map contains the following keys: name, material, subject, event, startTime


Rest Resources available on SAM

  • rest/getQueryResult
    • Description: Get the results of a full-text query applied to Assays and AssayTypes in SAM.
    • Parameter query: Contains a query string.
    • Return value: results map. It contains two keys 'studyIds', and 'assays'. 'studyIds' key maps to a list of Study domain objects of GSCF. 'assays' map to alist of pairs. Each pair consists of an Assay domain object of GSCF and additional assay information from SAM provided as a map.
    • Example of a returned map: [studyIds:[PPSH], assays:[[isIntake:false, isDrug:false, correctionMethod:test Correction Method 1, detectableLimit:1, isNew:false, class:data.SimpleAssay, externalAssayID:1, id:1, measurements:null, unit:Insulin, inSerum:false, name:test Simple Assay 1, referenceValues:test Reference Values 1]]]


  • rest/getQueryResultWithOperator
    • Description: Get the results of a full-text query on measurement types in SAM that also looks for specific measurement values.
    • Parameter query: Contains a query string to match in SAM's measurement types.
    • Parameter operator: '=','<', or '>'. The operator applied to the value during search.
    • Parameter operator: A double value to be searched as a measurement value in SAM.
    • Return value: A list of maps. Each Map contains enriched information about one measurement value in SAM (see example below).
    • Example of a call: http://localhost:8182/sam/rest/getQueryResultWithOperator/nil?query=Insulin&value='101'&operator='>'
    • Example of a resulting list of maps: [["type":"Glucose", "unit":"Insulin", "value":202, "assay":Lipid profiling, "sample":A11_B], ["type":"Glucose", "unit":"Insulin", "value":102, "assay":Lipid profiling, "sample":A1_B], ["type":"Insulin", "unit":"g", "value":201, "assay":Lipid profiling, "sample":A11_B], ["type":"Insulin", "unit":"g", "value":101, "assay":Lipid profiling, "sample":A1_B]]

Definitions

Term Definition
Ontology "A formal representation of a set of concepts within a domain and the relationships between those concepts" (Wikipedia). Technically, a set of terms (indexed by accession numbers) that are described in a standardized format, such as OWL (http://www.w3.org/TR/owl-features).
Template Describes a specific biological study type, and which metadata can be stored for that type of study. Technically, for each of the study metadata entities, a (possibly empty) set of information fields that should or can be specified in addition to the general fields for that entity