NGS De Novo Assembly Expertise

From BioAssist
Jump to: navigation, search

Return to the main page of Next Generation Sequencing

Please feel free to contact Jan van Haarst or Leon Mei for questions or comments related with NBIC de novo assembly tasks.

Overview of NBIC de-novo assembly activities

Wageningen University and Research Centre

Jan van Haarst is the main NBIC NGS researcher working on de-novo assembly. His team is primely working on plant de-novo assembly.

  • Tomato : Mostly 454 Ti data, plus Sanger reads and a couple of SOLiD runs (in total about 90 Gbase)
  • Potato : Mostly Illumina reads, ranging from 75-125 bp, plus Sanger reads and a few 454 runs.

Amsterdam Medical Center

Barbera van Schaik has experience with de-novo assembly using Roche 454 data. Marcel Willemsen is working with SOLiD data. They performed some de-novo tasks on bacteria using Cabog.

Leiden Genome Technology Center

The group of Matthew Hestand offer de-novo assembly services. The main tool used is Velvet. This group has a Helicos machine (3rd generation sequencer?) that offers de-novo sequencing capability at a much lower cost.

Erasmus Medical Center

The Bioinformatics group (Andrew Stubbs and Stephan Nouwens) in EMC together with the group of Wilfred van Ijcken perform quite some de-novo tasks for the EMC internal users. The main tools used are Velvet and CLC. Abyss is also under investigation and testing.

On-going de novo assembly projects

Group Orgnanism Genome Property Assembly Requirement Machine & Data format Used Software Quality Control
AMC Bacteria (human, mouse in future) 2MB Roche FLX/Ti, future SOLiD;sff, fasta+qual Newbler, CABOG Standard sequencing QA scores (#contigs, size, N50), Contamination check
CVI Pathogenic bacteria Genomes currently of 2 major species and both around 2MB in size. Determine antigenic variation, determine targets for high-resolution typing, determine virulence factors and determine vaccine candidates. Currently we are not at SNP level yet and can do with larger inserts/deletions. For typing SNPs may become important and coverage we aim at >10-15 fold. Roche FLX/Ti, Illumina; sff DNASTAR, tgicl, Spaghetti, SOAPdenovo, clcbio
EMC Human, mouse 3Gbp. discover genomic rearrangements aka structural variation 76 bp Paired end of a Illumina Genome Analzyer 2x; SAM Velvet, Clcbio we use BACs as a control and check on vector assembly we blast to mouse
EMC Ellobius ~3Gbp. Discover genetic sex mechanism. 3 fold oversampling 36 or 76 bp Paired end using BACs and genomic DNA; SAM Abyss? blast as a check on not re-arranged fragments
RIVM Mycobacterium tuberculosis 4.4MB We are mainly looking for polymorphisms between the sequenced genomes and to a reference genome. With our data we received with the use of paired-end sequencing usually 10-20 scaffolds per genome, without paired-end about 150 contigs. Roche/Illumina; sff, fasta+qual Newbler, GSMapper,ROAST
NBIC Single isolates: Mainly bacteria with a focus on gram positive types. 4 Mb, haploid, but sometimes many plasmids. currently between 50-400 scaffolds. Focus on gene-content per isolate. 454, Solexa Celera, Arachne and Newbler for 454, Abyss, ssake, vcake,Velvet and VAAL for solexa (Abyss seems to perform best) Guidelines and fast tools, currently it is difficult to do a good assessment.
NBIC Metagenomes: Many different species focus on contigs containing genes, larger pieces may be useful. 454, Solexa Celera, Arachne and Newbler for 454, Abyss, ssake, vcake,Velvet and VAAL for solexa (Abyss seems to perform best) Guidelines and fast tools, currently it is difficult to do a good assessment.
LUMC

De novo Assembly Software & NBIC users

Name Description Input&Output NBIC users experience Evaluation Status
Celera Assembler (CABOG) CA is a "whole genome shotgun sequence assembler". The version 5 of CA (CABOG) is robust to homopolymer run length uncertainty, high read coverage, and heterogeneous read lengths.
  • Less and larger contigs with Roche FLX reads compared to Newbler. It gives problems with Roche Titanium now-and-then.
  • for long reads
  • Input: FRG, SFF
  • Output: ASM, QC Metrics, POSMAP, FASTA
Jan, Barbera. Victor Phase 1
Newbler 454's assembler that comes with the machine. It is NOT Open Source and specifically adapted to handle the main source of error in 454 sequencing: ambiguity in the lenth of homopolymer runs.
  • for long reads
  • Input: SFF, FASTA
  • Output:
Jan, Victor, Christian, Barbera, Jurgen P Phase 1
Velvet Needs about 20-25X coverage and paired reads. Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI).
  • for short reads
  • support paired input (by interleaving in a single file)
  • Input: FASTA, FASTQ, Eland, Gerald
  • Output: FASTA, AMOS (contigs.fa, stats.txt, velvet_asm.afg, LastGraph)
Jan, Matt, Jurgen, Victor Phase 1
SOAPdenovo
  • for short reads
  • Input: FASTA, FASTQ
  • Output: *.contig, *.scafSeq
Jan Phase 1
ABySS "The single-processor version is useful for assembling genomes up to 40-50 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes."
  • for short reads
  • Input: FASTA, FASTQ, qseq
  • Output: *.fa
Jan, Jurgen, Victor Phase 1
MIRA3 MIRA (Mimicking Intelligent Read Assembly) is able to perform true hybrid de-novo assemblies using reads gathered through 454 sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa and Sanger data. Linux OS required. Jan, Victor Phase 1
Clcbio Phase 1
SHORTY "..Our assembler SHORTY is targetted for de novo assembly of microreads with mate pair information and sequencing errors. SHORTY has some novel approach and features in addressing the short read assembly problem.." Jan
ALLPATHS De novo assembly of whole-genome shotgun microreads.
SHARCGS SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy. Authors are Dohm JC, Lottaz C, Borodina T and Himmelbauer H. from the Max-Planck-Institute for Molecular Genetics.
EDENA De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Made by Hernandez D et al. Jan
VCAKE De novo assembly of short reads with robust error correction. An improvement on early versions of SSAKE.
ALLPATHS-LG Newly introduced on January 2011 by the Broad Institute. It works on both small and large (mammalian size) genomes To use it, you should first generate ~100 base Illumina reads from two libraries: one from ~180 bp fragments, and one from ~3000 bp fragments, both at about 45x coverage. Sequence from longer fragments will enable longer-range continuity

Benchmarking of De-novo Assembly Software

More information can be found at Software Evaluation.

De novo assembly software evaluation pipeline

Representation format of genome assembly

NCBI defines a submission format for submitting the WGS project result. It consists of the following:

  • Numbering

Each project is assigned with a 4-letter project accession prefix (e.g. "abcd") and a 2-digit version element (e.g. "01"). Contigs are ordered with a 6-digit number. Thus, each contig can be identified using a unique identifier. For example the first contig is numbered as "abcd01000001".

  • Contigs

A list of contigs in FASTA format (.fsa).

  • Super contig/Scaffold

Optional. In AGP format.

  • Annotation

A 5-column table in a .tbl file for each .fsa file.

  • Quality

Optional. If quality score will be submitted, it must be in files named *.qvl that are in the same directory and have the same nucleotide SeqIDs as the corresponding *.fsa files.

Additional

  1. For bacteria assembly, please refer to http://www.ncbi.nlm.nih.gov/genbank/genomesubmit.html for a specific format.
  2. For metagenome assembly, please refer to http://www.ncbi.nlm.nih.gov/genbank/metagenome.html.