NGS De Novo Assembly Expertise
Return to the main page of Next Generation Sequencing
- 1 Overview of NBIC de-novo assembly activities
- 2 De novo Assembly Software & NBIC users
- 3 Benchmarking of De-novo Assembly Software
- 4 Representation format of genome assembly
Overview of NBIC de-novo assembly activities
Wageningen University and Research Centre
Jan van Haarst is the main NBIC NGS researcher working on de-novo assembly. His team is primely working on plant de-novo assembly.
- Tomato : Mostly 454 Ti data, plus Sanger reads and a couple of SOLiD runs (in total about 90 Gbase)
- Potato : Mostly Illumina reads, ranging from 75-125 bp, plus Sanger reads and a few 454 runs.
Amsterdam Medical Center
Barbera van Schaik has experience with de-novo assembly using Roche 454 data. Marcel Willemsen is working with SOLiD data. They performed some de-novo tasks on bacteria using Cabog.
Leiden Genome Technology Center
The group of Matthew Hestand offer de-novo assembly services. The main tool used is Velvet. This group has a Helicos machine (3rd generation sequencer?) that offers de-novo sequencing capability at a much lower cost.
Erasmus Medical Center
The Bioinformatics group (Andrew Stubbs and Stephan Nouwens) in EMC together with the group of Wilfred van Ijcken perform quite some de-novo tasks for the EMC internal users. The main tools used are Velvet and CLC. Abyss is also under investigation and testing.
On-going de novo assembly projects
|Group||Orgnanism||Genome Property||Assembly Requirement||Machine & Data format||Used Software||Quality Control|
|AMC||Bacteria (human, mouse in future)||2MB||Roche FLX/Ti, future SOLiD;sff, fasta+qual||Newbler, CABOG||Standard sequencing QA scores (#contigs, size, N50), Contamination check|
|CVI||Pathogenic bacteria||Genomes currently of 2 major species and both around 2MB in size.||Determine antigenic variation, determine targets for high-resolution typing, determine virulence factors and determine vaccine candidates. Currently we are not at SNP level yet and can do with larger inserts/deletions. For typing SNPs may become important and coverage we aim at >10-15 fold.||Roche FLX/Ti, Illumina; sff||DNASTAR, tgicl, Spaghetti, SOAPdenovo, clcbio|
|EMC||Human, mouse||3Gbp.||discover genomic rearrangements aka structural variation||76 bp Paired end of a Illumina Genome Analzyer 2x; SAM||Velvet, Clcbio||we use BACs as a control and check on vector assembly we blast to mouse|
|EMC||Ellobius||~3Gbp.||Discover genetic sex mechanism. 3 fold oversampling||36 or 76 bp Paired end using BACs and genomic DNA; SAM||Abyss?||blast as a check on not re-arranged fragments|
|RIVM||Mycobacterium tuberculosis||4.4MB||We are mainly looking for polymorphisms between the sequenced genomes and to a reference genome. With our data we received with the use of paired-end sequencing usually 10-20 scaffolds per genome, without paired-end about 150 contigs.||Roche/Illumina; sff, fasta+qual||Newbler, GSMapper,ROAST|
|NBIC||Single isolates: Mainly bacteria with a focus on gram positive types.||4 Mb, haploid, but sometimes many plasmids.||currently between 50-400 scaffolds. Focus on gene-content per isolate.||454, Solexa||Celera, Arachne and Newbler for 454, Abyss, ssake, vcake,Velvet and VAAL for solexa (Abyss seems to perform best)||Guidelines and fast tools, currently it is difficult to do a good assessment.|
|NBIC||Metagenomes: Many different species||focus on contigs containing genes, larger pieces may be useful.||454, Solexa||Celera, Arachne and Newbler for 454, Abyss, ssake, vcake,Velvet and VAAL for solexa (Abyss seems to perform best)||Guidelines and fast tools, currently it is difficult to do a good assessment.|
De novo Assembly Software & NBIC users
|Name||Description||Input&Output||NBIC users experience||Evaluation Status|
|Celera Assembler (CABOG)|| CA is a "whole genome shotgun sequence assembler". The version 5 of CA (CABOG) is robust to homopolymer run length uncertainty, high read coverage, and heterogeneous read lengths.
||Jan, Barbera. Victor||Phase 1|
|Newbler|| 454's assembler that comes with the machine. It is NOT Open Source and specifically adapted to handle the main source of error in 454 sequencing: ambiguity in the lenth of homopolymer runs.
||Jan, Victor, Christian, Barbera, Jurgen P||Phase 1|
|Velvet|| Needs about 20-25X coverage and paired reads. Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI).
||Jan, Matt, Jurgen, Victor||Phase 1|
|ABySS|| "The single-processor version is useful for assembling genomes up to 40-50 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes."
||Jan, Jurgen, Victor||Phase 1|
|MIRA3||MIRA (Mimicking Intelligent Read Assembly) is able to perform true hybrid de-novo assemblies using reads gathered through 454 sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa and Sanger data. Linux OS required.||Jan, Victor||Phase 1|
|SHORTY||"..Our assembler SHORTY is targetted for de novo assembly of microreads with mate pair information and sequencing errors. SHORTY has some novel approach and features in addressing the short read assembly problem.."||Jan|
|ALLPATHS||De novo assembly of whole-genome shotgun microreads.|
|SHARCGS||SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy. Authors are Dohm JC, Lottaz C, Borodina T and Himmelbauer H. from the Max-Planck-Institute for Molecular Genetics.|
|EDENA||De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Made by Hernandez D et al.||Jan|
|VCAKE||De novo assembly of short reads with robust error correction. An improvement on early versions of SSAKE.|
|ALLPATHS-LG||Newly introduced on January 2011 by the Broad Institute. It works on both small and large (mammalian size) genomes||To use it, you should first generate ~100 base Illumina reads from two libraries: one from ~180 bp fragments, and one from ~3000 bp fragments, both at about 45x coverage. Sequence from longer fragments will enable longer-range continuity|
Benchmarking of De-novo Assembly Software
More information can be found at Software Evaluation.
Representation format of genome assembly
NCBI defines a submission format for submitting the WGS project result. It consists of the following:
Each project is assigned with a 4-letter project accession prefix (e.g. "abcd") and a 2-digit version element (e.g. "01"). Contigs are ordered with a 6-digit number. Thus, each contig can be identified using a unique identifier. For example the first contig is numbered as "abcd01000001".
A list of contigs in FASTA format (.fsa).
- Super contig/Scaffold
Optional. In AGP format.
A 5-column table in a .tbl file for each .fsa file.
Optional. If quality score will be submitted, it must be in files named *.qvl that are in the same directory and have the same nucleotide SeqIDs as the corresponding *.fsa files.
- For bacteria assembly, please refer to http://www.ncbi.nlm.nih.gov/genbank/genomesubmit.html for a specific format.
- For metagenome assembly, please refer to http://www.ncbi.nlm.nih.gov/genbank/metagenome.html.