Benchmarking De Novo Assembly Software
This project aims at a systematic benchmarking on the existing de novo assemblers. One of the next steps is to collaborate with another NBIC NGS effort on benchmarking alignment software to create an automated benchmarking framework.
We plan to create and use six sets of test sequences that are the combination of 3 organism types and 4 read types. A set of test sequences is not an entire genome but one or part of a chromosome.
|Read type||Bacteria (E-coli, ~4.6Mbp)||Plant (Arabidopsis thaliana, genome size ~157Mbp)||Mammal (Human, mouse, ~3Gbp)|
|Short read (36bp, Illumina, FASTA/Q)||ENTREZ SRX000429|
|Paired Short read (36bp, Illumina, FASTA/Q)||ENTREZ SRX000429|
|Long read (~200bp, Roche 454, SFF+Qual)|
|Paired Long read (~200bp, Roche 454, SFF+Qual)|
This generic software module can perform the file format conversion from test sequence to the required input file format of a particular assembly software. For each assembly software, the container can tune it into two parametric settings:
- optimized for a particular de novo assembly task
In the analysis step, we look at the quality of assembly and the system performance of assembler. More details about the metrics used can be found at Assembly metrics explained.
For each evaluated assembly software, the report will contains 6*2 sets of result. That's the combination of 6 sets of test sequences and 2 types of parameter settings.
Planning and Milestones
In phase 1, we will examine the following software for E-coli data:
- Long reads : Newbler, Cabog, CLC, MIRA3
- Short reads: Velvet, SOAPdenovo, Abyss, CLC
Progressive results can be found at https://wiki.nbic.nl/index.php/Raw_results_of_NGS_de_novo_assembly
We aim to finish this by 15 April so that some results can be presented at the NGS meeting on 16 April.
Phase 2 (middle term)
- We will start create a generic evaluation pipeline in Galaxy
- More software will be tested with other test sequences
- We plan to create several de novo assembly pipelines in Galaxy. Each of them is recommended for certain datasets.
- This sequence read is also used in Simpson et. al, "ABySS: A parallel assembler for short read sequence data"