From BioAssist
Jump to: navigation, search

This webpage is no longer maintained and only here for archive purposes. Please refer to https://trac.nbic.nl/gscf and http://dbnp.org for up to date information on this project.

DbNP Genetics Module

The DbNP genetics module is meant both to capture genetic information on subjects in the database and to facilitate any genetics/genomics research that can be done with these captured genotypes. See also the discussion on the talk page: Talk:DbNPGeneticsModule. See also the following conceptual maps at UCDavis.

Overview / Functional requirements

The first function of the module is to facilitate the storage of genetic data. This implies storage of the individual genotypes (such as SNP alleles or description of knock-out or over-expressing gene(s)) of subjects in the database, but also more general genetic information that is needed when doing genetics research (such as SNP chromosome locations and gene copy number). This function is covered in the first paragraph.

The second function of the module is to expose dbNP 'clean data features' that are relevant for querying and data mining. To find out which features the module should expose, an inventarisation is made of the biological questions that the module should be able to answer for the different stored data types. This is done in the second paragraph.

The third function of the module involves bridging the first and second functions. To arrive from the stored data at data that is interesting to combine with general study metadata (information about subjects, interventions, groups, events, assays) or clinical data (such as lipid levels, BMI etc.) to answer biological questions, data processing pipelines are needed. The requirements for the data processing for the different types of genomics data are covered in the third paragraph.

Data storage

As mentioned in the overview, we need to store individual genotypes as well as additional genomic information. In fact, information can be stored at three levels:

  • individual level: such as actual SNP alleles as determined by either a SNP array or directed genotyping, or differences in sequence of the genome compared to a standard reference genome
  • population level: such as allele frequencies within e.g. a HapMap or study population (do you also want to compute this for populations in your own database?)
  • species level: such as the complete NCBI human genome build 37.1 gene positions, or the chromosome locations of SNPs in dbSNP

Data at the individual level are covered in the section 'Genotypic information'. Population and species level data are stored in the genome database of the species involved, see 'Genome database'.

Genotypic information

On a per subject basis, the following information should be stored:

  • subject ID
  • subject population or strain
  • subject study (name of the study to which the subject belongs)
  • variant class (SNP, indel, CNV, transgenic knockout, transgenic overexpression, RNAi knockdown, chromosome translocation, chromosome inversion)
  • method of detecting variant (single SNP genotyping, GWAS, CGH, sequencing, PCR, RFLP)
  • variant ID
  • variant alleles
  • phenotype
  • condition of environment (e.g., high-fat intervention, exercise challenge)

SNP alleles

The SNP genotypes should be stored in a large table with the subjects in the rows and the following columns:

  • subject ID
  • SNP rsID
  • SNP allele sequence

Genome database

The genome database is meant to facilitate and support biological queries. It has a general structure, described below, but there is an actual instance of this database for each species involved. At this point, it is likely that three species are key: human, mouse and rat.

It contains the following information:

Reference genome

For each species in the database, a reference genome should be chosen and stored to refer to. These reference genome builds are:

Species Provider Build
Homo sapiens NCBI 36.3, later 37.1 (provisional)
Mus musculus NCBI 37.1
Rattus norvegicus NCBI RGSC v3.4

SNP database

The SNP database should be a table with the SNPs in the rows and the following columns:

  • organism
  • SNP rsID
  • RefSNP alleles
  • Ancestral allele
  • Chromosome number
  • Chromosome location on reference genome
  • Associated gene(s)
  • HGVS names
  • mRNA accession
  • mRNA position
  • mRNA codon change
  • protein accession
  • protein position
  • protein residue change

Estimated size? 20 million

Biological queries / data features

SNP data by gene

  • Feature name: SNPsByGene
  • Question: give me all the SNP allele data for subjects [X] which are associated with gene Z
  • Input: gene Entrez ID and subject list
  • Output: a table with subjects in rows and the found SNPs in the columns, and the alleles in the cells

Data processing pipelines

SNP array readout ?

Convert SNP array machine output to the SNP data table as specified in the Data storage section