Raw results of NGS de novo assembly

From BioAssist
Jump to: navigation, search

This page contains the raw results of de novo assembly software evaluation task. This task is carried out mainly by Jan van Haarst at Wageningen University and Research Centre.

Testing environment

  • Model : Dell PowerEdge R710 (from 2010)
  • OS: Ubuntu 10/04
  • CPU: 4 x Six-Core Intel Xeon CPU (X5650) @ 2.67GHz
  • Mem: 48 GB
  • Harddisk: Dell MD1000 over Gbit NFS

Metric Explanation

  • N50 length

Ordering all contigs from longest to shortest until the summed length exceeded 50% of the total length of all sequences. The lengh of the latest added contig is N50 length. In general, the larger the N50 length, the higher the quality of the assembled genome.

  • N50 index

Ordering all contigs from longest to shortest until the summed length exceeded 50% of the total length of all sequences. The index of the latest added contig is N50 index. In general, the smaller the N50 index, the higher the quality of the assembled genome.

Short Reads (Illumina)

Program name Version Input data Paired run Run settings Runtime (in minutes) Max memory  % sequencelength Largest contig Number of contigs larger than 100 N50 length N50 index Number of mismatches to reference[1] (SNP+Indels)[2] Percentage of N's in consensus Percentage of unused reads in assembly Remarks
Velvet 2010-06-28 : 1.0.03 ENA SRX000429[3], unfiltered short reads E.coli No 31 -exp_cov auto -min_contig_lgth 100 -cov_cutoff auto -read_trkg yes -amos_file yes 15 [3] 3.6 GB 98.04 90008 569 19122 66 15 0 9.73 -
Velvet 2010-06-28 : 1.0.03 ENA SRX000429[3], unfiltered short reads E.coli Yes 31 -exp_cov auto -min_contig_lgth 100 -cov_cutoff auto -read_trkg yes -amos_file yes -ins_length 200 15 [3] 3.6 GB 98.21 268136 179 98915 16 43 0.1 9.58 -
SOAPdenovo 2009-12-21 : 1.04 ENA SRX000429[3], unfiltered short reads E.coli No all -s soapdenovo.config -o soapdenovo -K 31 -p 4 -R 2 3.9 GB 98.06 77302 620 17052 75 13 0  ? -
SOAPdenovo 2009-12-21 : 1.04 ENA SRX000429[3], unfiltered short reads E.coli Yes SOAPdenovo all -s soapdenovo_coli_paired.config -o soapdenovo -K 31 -p 4 -R 3 3.9 GB 98.19 267992 167 111745 14 26 0.0  ? -
ABySS 2010-05-26 : 1.2.0 ENA SRX000429[3], unfiltered short reads E.coli No ABYSS --kmer=31 --verbose 5 1 GB 97.9 67081 595 18242 71 12 0.0  ? aptitude install libsparsehash-dev openmpi-bin openmpi-common libopenmpi-dev
ABySS 2010-05-26 : 1.2.0 ENA SRX000429[3], unfiltered short reads E.coli Yes abyss-pe -j1 k=31 n=10 v=-v name=coli lib=pairedend pairedend=SRR001665_1.fastq SRR001665_2.fastq 8 1 GB 99.8 210773 127 87403 19 51 0.0  ? aptitude install libsparsehash-dev openmpi-bin openmpi-common libopenmpi-dev
CLC 2010-04-06 : 3.0.3 ENA SRX000429[3], unfiltered short reads E.coli No clc_novo_assemble --min-length 100 --output clc_contigs.fasta --reads 2 0.5 GB 98.13 50240 776 11851 118 16 0  ? licence file must be in cwd
CLC 2010-04-06 : 3.0.3 ENA SRX000429[3], unfiltered short reads E.coli Yes clc_novo_assemble --min-length 100 -p fb ss 180 250 --output clc_contigs.fasta --reads 3 1 GB 97.96 107342 363 30635 47 31 0  ? licence file must be in cwd

Long Reads (454)

Program name Version Input data Paired run Run settings Runtime (in minute) Max memory  % sequencelength Largest contig Number of contigs larger than 100 N50 length N50 index Number of mismatches to reference[1] (SNP+Indels)[2] Percentage of N's in consensus Percentage of unused reads in assembly Remarks
newbler 2010-03-15 : v2.3 (091027_1459) ENA SRR000868-SRR000873ENA SRR001355[4], unfiltered long reads E.coli Yes newbler -ace 27 minutes total 1.2 GB 99.2 2500097 8 2500097 1 14 1.4 2.08 -
CABOG 2010-05-01 : v6.1 ENA SRR000868-SRR000873ENA SRR001355[4], unfiltered long reads E.coli Yes runCA -d $PROJECT -p $PROJECT createACE=1 cleanup=none *.frg 210 minutes total 1.4 GB 99.09 130409 242 37157 38 42 0 21.26 -
MIRA 2010-07-05 : 3.2.0rc1 ENA SRR000868-SRR000873ENA SRR001355[4], unfiltered long reads E.coli Yes mira --project=coli --job=denovo,genome,accurate,454 3960 minutes total 8.6 GB 102 % 414666 363 141313 11 27 0  ? -

Contact

For any question or comment related to this benchmarking, please contact us at ngstools-denovo@trac.nbic.nl

Notes

  1. 1.0 1.1 The reference used is Escherichia coli str. K-12 substr. MG1655.
  2. 2.0 2.1 These numbers are the number of GSNPs and Gindels that dnadiff from MUMmer 3.22 reports at standard settings
  3. 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 These sequence reads are also used in Simpson et. al, "ABySS: A parallel assembler for short read sequence data"
  4. 4.0 4.1 4.2 These sequence reads are also used in Miller et. al, "Aggressive assembly of pyrosequencing reads with mates" They were converted to sff using sff-dump -A SRR001028 -D ena/SRR001028/ , with sff-dump that is in http://www.ncbi.nlm.nih.gov/Traces/sra/static/sra_toolkit-1.0.0-b7-glibc.6-x86_64.tar.gz