Benchmarking Alignment & Variants Calling Software

From BioAssist
Jump to: navigation, search

The large volume of NGS data posts new challenges on the design of bioinformatic tools. Frequent updates and little documented experience of different tools make the tool selection and execution a difficult task by itself. This project address these problems by benchmarking these softwares and have an objective comparison. One major distinction with other similar benchmarking efforts is that we provide an automated benchmarking framework. Using this framework (or "wrapper" as we call in NBIC Galaxy), it should be easy then to redo the old benchmarking or benchmark new alignment software.

Letter-space

The benchmarking wrapper can wrap any alignment program and run against a standard testing dataset. The testing dataset is a set of sequence reads with known mapped position and SNP/Indel information. The wrapper can monitor the system performance and analyze the alignment quality. The output is a standard evaluation report. This wrapper is developed as a Galaxy module. The ultimate goal is to enable an automated benchmarking procedure for the existing and future alignment programs.

Source Code

The source code can be downloaded at https://trac.nbic.nl/galaxytools/browser/trunk/benchmark_alignment. We will soon install it also on NBIC Galaxy server.

Evaluation protocol

To evaluate alignment softwares, it is possible to get the following information:

  • Mapped/Unmapped/Unique mapped information
  • Coverage (any kind of coverage information), i.e., mapping depth: the average number of aligned reads to each base in reference genome/genes.
  • System information
  • Called variants, including true positives, true negatives, false positives, and false negatives.

Sample evaluation report:

Accuracy=0.875 (float)
Average_Coverage=32.4 (float)
Nr_Correctly_Mapped_Reads=1234567 (integer)
Nr_Wrongly_Mapped_Reads=123 (integer)
Nr_Unmapped_Reads=12345 (integer)
TP_SNP=150 (integer)
FP_SNP=20 (integer)
FN_SNP=10 (integer)
MAX_InDel_Size=3 (integer) ?
TP_InDel=100 (integer)
FP_InDel=30 (integer)
FN_InDel=20 (integer)
Peak_Mem=2048 (integer) (MB)
CPU_Time=7200 (integer) (Seconds)

Version 1

The aim of version 1 is to validate the design and produce a proof-of-concept demo in Galaxy. (Done)

Version 2

The aim of version 2 is to use the advanced simulated dataset and include the benchmarking of snp/indel callers. (Planned)

Meetings

Next Generation Sequencing: Alignment WP Discussion Notes


Color-space

Frans Ruzius performed a benchmarking on several color space alignment tools, i.e., for SOLiD reads.

  • SOCS: Program too simple
  • BFAST: Needs a lot of resources, difficult
  • Mapreads: Not so accurate based on our test sequences.
  • Maq: Relative fast, pseudo base alignment, less accurate
  • SHRiMP: Relative slow, true color space aligner, more accurate

More details can be found in presentation and report

On BWT-based software

Tools like Bowtie, BWA, and SOAPaligner/soap2 can perform alignment much faster compared to old hash-table based tools (e.g., Maq, ELAND, SHRiMP). We will perform a performance study on these tools in the near future.

  • Bowtie: all read input formats (FASTA, FASTQ, raw and tab-delimited) can be in color space with a colorspace flag (-C). An color space index can be built as well with the colorspace (-C) flag.