Benchmarking De Novo Assembly Software

From BioAssist
Jump to: navigation, search

This project aims at a systematic benchmarking on the existing de novo assemblers. One of the next steps is to collaborate with another NBIC NGS effort on benchmarking alignment software to create an automated benchmarking framework.

Evaluation Pipeline

Test Sequences

We plan to create and use six sets of test sequences that are the combination of 3 organism types and 4 read types. A set of test sequences is not an entire genome but one or part of a chromosome.

Read type Bacteria (E-coli, ~4.6Mbp) Plant (Arabidopsis thaliana, genome size ~157Mbp) Mammal (Human, mouse, ~3Gbp)
Short read (36bp, Illumina, FASTA/Q) ENTREZ SRX000429[1]
Paired Short read (36bp, Illumina, FASTA/Q) ENTREZ SRX000429[1]
Long read (~200bp, Roche 454, SFF+Qual)
Paired Long read (~200bp, Roche 454, SFF+Qual)

Container

This generic software module can perform the file format conversion from test sequence to the required input file format of a particular assembly software. For each assembly software, the container can tune it into two parametric settings:

  1. default
  2. optimized for a particular de novo assembly task

Assembly Analysis

In the analysis step, we look at the quality of assembly and the system performance of assembler. More details about the metrics used can be found at Assembly metrics explained.

Report

For each evaluated assembly software, the report will contains 6*2 sets of result. That's the combination of 6 sets of test sequences and 2 types of parameter settings.

De novo assembly software evaluation pipeline

Planning and Milestones

Phase 1

In phase 1, we will examine the following software for E-coli data:

  • Long reads : Newbler, Cabog, CLC, MIRA3
  • Short reads: Velvet, SOAPdenovo, Abyss, CLC


Progressive results can be found at https://wiki.nbic.nl/index.php/Raw_results_of_NGS_de_novo_assembly

We aim to finish this by 15 April so that some results can be presented at the NGS meeting on 16 April.

Phase 2 (middle term)

  • We will start create a generic evaluation pipeline in Galaxy
  • More software will be tested with other test sequences

Long term

  • We plan to create several de novo assembly pipelines in Galaxy. Each of them is recommended for certain datasets.

Discussions

NGS De Novo Discussions

Deliverables

Notes

  1. 1.0 1.1 This sequence read is also used in Simpson et. al, "ABySS: A parallel assembler for short read sequence data"