DbNPGeneticsModule
This webpage is no longer maintained and only here for archive purposes. Please refer to https://trac.nbic.nl/gscf and http://dbnp.org for up to date information on this project.
DbNP Genetics Module
The DbNP genetics module is meant both to capture genetic information on subjects in the database and to facilitate any genetics/genomics research that can be done with these captured genotypes. See also the discussion on the talk page: Talk:DbNPGeneticsModule. See also the following conceptual maps at UCDavis.
Overview / Functional requirements
The first function of the module is to facilitate the storage of genetic data. This implies storage of the individual genotypes (such as SNP alleles or description of knock-out or over-expressing gene(s)) of subjects in the database, but also more general genetic information that is needed when doing genetics research (such as SNP chromosome locations and gene copy number). This function is covered in the first paragraph.
The second function of the module is to expose dbNP 'clean data features' that are relevant for querying and data mining. To find out which features the module should expose, an inventarisation is made of the biological questions that the module should be able to answer for the different stored data types. This is done in the second paragraph.
The third function of the module involves bridging the first and second functions. To arrive from the stored data at data that is interesting to combine with general study metadata (information about subjects, interventions, groups, events, assays) or clinical data (such as lipid levels, BMI etc.) to answer biological questions, data processing pipelines are needed. The requirements for the data processing for the different types of genomics data are covered in the third paragraph.
Data storage
As mentioned in the overview, we need to store individual genotypes as well as additional genomic information. In fact, information can be stored at three levels:
- individual level: such as actual SNP alleles as determined by either a SNP array or directed genotyping, or differences in sequence of the genome compared to a standard reference genome
- population level: such as allele frequencies within e.g. a HapMap or study population (do you also want to compute this for populations in your own database?)
- species level: such as the complete NCBI human genome build 37.1 gene positions, or the chromosome locations of SNPs in dbSNP
Data at the individual level are covered in the section 'Genotypic information'. Population and species level data are stored in the genome database of the species involved, see 'Genome database'.
Genotypic information
On a per subject basis, the following information should be stored:
- subject ID
- subject population or strain
- subject study (name of the study to which the subject belongs)
- variant class (SNP, indel, CNV, transgenic knockout, transgenic overexpression, RNAi knockdown, chromosome translocation, chromosome inversion)
- method of detecting variant (single SNP genotyping, GWAS, CGH, sequencing, PCR, RFLP)
- variant ID
- variant alleles
- phenotype
- condition of environment (e.g., high-fat intervention, exercise challenge)
SNP alleles
The SNP genotypes should be stored in a large table with the subjects in the rows and the following columns:
- subject ID
- SNP rsID
- SNP allele sequence
Genome database
The genome database is meant to facilitate and support biological queries. It has a general structure, described below, but there is an actual instance of this database for each species involved. At this point, it is likely that three species are key: human, mouse and rat.
It contains the following information:
Reference genome
For each species in the database, a reference genome should be chosen and stored to refer to. These reference genome builds are:
Species | Provider | Build |
---|---|---|
Homo sapiens | NCBI | 36.3, later 37.1 (provisional) |
Mus musculus | NCBI | 37.1 |
Rattus norvegicus | NCBI | RGSC v3.4 |
SNP database
The SNP database should be a table with the SNPs in the rows and the following columns:
- organism
- SNP rsID
- RefSNP alleles
- Ancestral allele
- Chromosome number
- Chromosome location on reference genome
- Associated gene(s)
- HGVS names
- mRNA accession
- mRNA position
- mRNA codon change
- protein accession
- protein position
- protein residue change
Estimated size? 20 million
Biological queries / data features
SNP data by gene
- Feature name: SNPsByGene
- Question: give me all the SNP allele data for subjects [X] which are associated with gene Z
- Input: gene Entrez ID and subject list
- Output: a table with subjects in rows and the found SNPs in the columns, and the alleles in the cells
Data processing pipelines
SNP array readout ?
Convert SNP array machine output to the SNP data table as specified in the Data storage section