Benchmarking De Novo Assembly Software

From BioAssist
Jump to: navigation, search

This project aims at a systematic benchmarking on the existing de novo assemblers. One of the next steps is to collaborate with another NBIC NGS effort on benchmarking alignment software to create an automated benchmarking framework.

Evaluation Pipeline

Test Sequences

We plan to create and use six sets of test sequences that are the combination of 3 organism types and 4 read types. A set of test sequences is not an entire genome but one or part of a chromosome.

Read type Bacteria (E-coli, ~4.6Mbp) Plant (Arabidopsis thaliana, genome size ~157Mbp) Mammal (Human, mouse, ~3Gbp)
Short read (36bp, Illumina, FASTA/Q) ENTREZ SRX000429[1]
Paired Short read (36bp, Illumina, FASTA/Q) ENTREZ SRX000429[1]
Long read (~200bp, Roche 454, SFF+Qual)
Paired Long read (~200bp, Roche 454, SFF+Qual)


This generic software module can perform the file format conversion from test sequence to the required input file format of a particular assembly software. For each assembly software, the container can tune it into two parametric settings:

  1. default
  2. optimized for a particular de novo assembly task

Assembly Analysis

In the analysis step, we look at the quality of assembly and the system performance of assembler. More details about the metrics used can be found at Assembly metrics explained.


For each evaluated assembly software, the report will contains 6*2 sets of result. That's the combination of 6 sets of test sequences and 2 types of parameter settings.

De novo assembly software evaluation pipeline

Planning and Milestones

Phase 1

In phase 1, we will examine the following software for E-coli data:

  • Long reads : Newbler, Cabog, CLC, MIRA3
  • Short reads: Velvet, SOAPdenovo, Abyss, CLC

Progressive results can be found at

We aim to finish this by 15 April so that some results can be presented at the NGS meeting on 16 April.

Phase 2 (middle term)

  • We will start create a generic evaluation pipeline in Galaxy
  • More software will be tested with other test sequences

Long term

  • We plan to create several de novo assembly pipelines in Galaxy. Each of them is recommended for certain datasets.


NGS De Novo Discussions



  1. 1.0 1.1 This sequence read is also used in Simpson et. al, "ABySS: A parallel assembler for short read sequence data"