NGS File Format

From BioAssist
Jump to: navigation, search

Return to the main page of Next Generation Sequencing


The Sequence Read Archive (SRA) stores raw sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD® System, Helicos Heliscope®, Complete Genomics®, and others.

Genome Format

Read/Sequence Format

  • SRF (Sequence Read Format) was designed for being the single format capable of storing data generated by any DNA sequencing technology. However, there seems not many usage and mentioning after 2008.
  • FASTA is one of the earliest standard formats and is supported by Blast. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).
  • FASTQ is a representation file format for sequence data with quality. It is used as input for a lot of programs. Sometimes also given as QUAL.
  • SFF (Standard Flowgram Format) is used as output by the 454 sequencers.
  • SCARF - A standard Illumina output format, for sequence data with quality.
  • AB1 - Chromatogram files used by instruments from Applied Biosystems
  • EMBL is a flat file format used by the EMBL to represent database records for nucleotide and peptide sequences from EMBL databases

SNP Format

Below is not really a set formats but most important information:

  • Frequencies bases
  • Reference base
  • Position ( reference )
  • Position (Concensus of all mapped reads)
  • Type ( depending on algorithm / tool used)
  • Overlapping annotation
  • Amino Acid change ( this list is a combination of illumina and CLCbio )

Gene/Protein Binding Format

Browser/Visualization Format

A comprehensive page of various visualization data formats is [1]. Some commonly used formats are listed below:

  • SAM format is a compact and index-able representation of alignment results. It is the output format of many popular alignment tools, e.g., Bowtie, BWA, SOAP2, Illumina GA pileline, MAQ, BLAST, etc.
  • BAM is the compressed binary version of the SAM format. It provides an efficient way of displaying very large alignment results in UCSC browser.
  • WIG format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data.
  • BED is the main format to define the data lines that are displayed in an annotation track.
  • GFF is a format for describing genes and other features associated with DNA, RNA and Protein sequences.
  • GTF is a refinement to GFF that tightens the specification.