NGS Visualization Expertise

From BioAssist
Jump to: navigation, search

This page will describe the needs of visualization in NGS and our task force effort on that.

This page is currently under construction

NBIC activities on Genomic data visualization

Visualization needs for NGS

The amount of data generated in Next Generation Sequencing (NGS) experiments poses very specific problems for data visualization. In a typical next-generation sequencing experiment, many millions of reads are generated. These reads are aligned to a reference sequence by alignment software. This yields a dataset with different heterogeneousness properties per read, such as the bases of which the read consists, quality per base, the reference identifier and position at which the read starts. As this dataset contains many millions of reads, direct visualization of the data is impractical. To overcome this limitation, data transformations are necessary. For NGS data, several transformations are commonly used, such as summary statistics and coverage determination.

Summary statistics

To visualize summary statistics for a typical NGS experiment, a researcher first needs to determine the summary statistics themselves. To this end several programs can be used, such as Bamtools [1] or Picard [2]. Both of these tools require binary SAM files (BAM) as input. Both these tools are used on the command-line and thus are not as user-friendly as tools with a graphical user interface. In addition, the calculated statistics are sent to the command-line and thus a user needs to retrieve them and put them in a spreadsheet. Statistical tools, such as [3] or [4], can then be used to visualize the data. It is safe to say that these steps can not be performed by a novice user. Implementing these tools in GALAXY may substantially simplify these procedures [5].

Genome browsers

Many genome browser have been developed over the last few years, but not all of them are able to visualize next-generation sequencing data. Many of these require specific input formats that have to be generated from your data. Most of these require the coverage to be calculated from your data. This valuee state the (relative) number times each base in the reference sequence was sequenced. To get this value the samtools software can be used (samtools pileup) [6].



UCSC and ENSEMBL genome browsers have worked

plot tracks on the genome

the concept of coverage

Standardized data formats

  • GFF
  • BED
  • WIG
  • BedGraph

Several examples

  • UCSC
  • Ensembl
  • Commercial software

Nielsen et al., 2010

A fast and easy-to-use genome browser implemented by our member Frans Paul Ruzius: http://genetics.genomicscenter.nl/tagbrowser/



Sequence Viewers

The "pixel problem"


Genome Browsers

Sequence Viewers

plot "reads" on a genomic backbone

Examples:

  • samtools view
  • ...
  • Commercial software

Nielsen et al., 2010

Custom visualizations

  • R as a visualization platform

Visualization modules and NBIC users