Next Generation Sequencing: Alignment WP Discussion Notes

From BioAssist
Jump to: navigation, search
20 April

Discussions:

  1. We agree toimplement an alignment evaluation/benchmarking module in Galaxy. This module is a kind of wrapper that can invocate a particular alignment tool (e.g. Bowtie) and output an evaluation report.
  2. A standard benchmarking illumina dataset will be defined.
  3. The first version of the wrapper will be delivered at the end of May
  4. Freerk van Dijk (UMCG), Frans Paul Ruzius (Hubrecht) and Leon Mei (NBIC) will work jointly on this.

New action points:

  1. Freerk will check which dataset to use and which reference to use.
  2. Freerk will check the timeslot to run the wrapper on UMCG's galaxy server.
  3. Leon will talk with Morris on the possibility of sharing the result and publications and future collaboration.
  4. Freerk will talk with Morris on the possibility to work on the Galaxy wrapper and if it fits into his thesis report.
  5. Frans will create a wiki page on the evaluation protocol. He will do so based on the benchmarking he performed on the Solid aligners. Once the page is added, we could start improving it together.
  6. Leon will work on a draft time plan for the coming weeks and send across before our next meeting.
  7. Our next Skype meeting will be started at 11am instead of 10am, on Tuesday, 27 April. Let's keep it as a regular weekly meeting.
27 April

Discussions:

  1. Morris agreed for one week time budget from Freerk. More is negotiable. We will try first to finish the wrapper within Freerk's budget.
  2. NBIC Galaxy server is being setup. So part of the benchmarking test can be done on the NBIC server.
  3. Bowtie, BWA, MAQ are selected as the testing tools for the evaluation wrapper.
  4. Frans has a real benchmarking dataset in colorspace: 125K bases, from Chrom

Unfinished action points:

  1. Freerk will check which dataset to use and which reference to use.
  2. Freerk will check the timeslot to run the wrapper on UMCG's galaxy server.
  3. Leon will work on a draft time plan for the coming weeks and send across before our next meeting.

New action points:

  1. Freerk will make an estimation on how many days are required for implementing the wrapper based on inputs from Frans and Leon.
  2. Freerk will update Morris on our discussion and see if a meeting with Morris is needed.
  3. Frans will create a simulated illumina dataset and share it together with the script to Freerk.
  4. Leon will create a standard evaluation report format based on the evaluation metrics and send it across.
  5. Frans will create a parser to process SAM output into an evaluation report based on the format from Leon.
  6. Frans will check the methods in Perl to retrieve total CPU time, peak CPU usage, average memory usage and peak memory usage. And share this with Freerk.
  7. Frans will send the parameter setting information on Bowtie, BWA and MAQ to Freerk.
4 May

Discussions:

  1. We will use human genome build 36 as the reference genome. It has all known SNPs. Frans will provide the simulated testing data set. The real data set is still not clear.
  2. The implementation of evaluation wrapper is in progress. Will test MAQ first. Creating the index file for MAQ is on-going.


Unfinished action points:

  1. Leon will work on a draft time plan for the coming weeks and send across before our next meeting.
  2. Freerk will update Morris on our discussion and see if a meeting with Morris is needed.
  3. Frans will create a simulated illumina dataset and share it together with the script to Freerk.
  4. Frans will create a parser to process SAM output into an evaluation report based on the format from Leon.
  5. Frans will send the parameter setting information on Bowtie, BWA and MAQ to Freerk.

New action points:

  1. Frans will try to get a real human illumina dataset (with known SNP an Indels) and send it to Freerk this Friday.
  2. Leon will create some sample code in Perl to retrieve total CPU time, peak CPU usage, average memory usage and peak memory usage and send it to Freerk this Friday.
11 May

Agenda:

  1. Action points
  2. Testing data set (simulated data set is used in a similar evaluation work http://www.ncbi.nlm.nih.gov/pubmed/19636379)
  3. Changes of time plan of Freerk
  4. Overall time plans (7 June demo?)

Discussions:

  1. Freerk created the index of reference genome (UCSC human build 36 and 37) for BWA, Bowtie. MAQ is on-going.
  2. Frans created a simulated dataset: Illumina, 2M reads, 50 bases per read, all unique, and uniformly generated from entire human genome (NCBI build 36).
  3. FASTQ is agreed as the format for testing dataset.
  4. Due to the difficulty on obtaining real dataset, we will use simulated dataset for now and possibly shorten the evaluation report by removing detection of SNP/Indel.

Unfinished action points:

  1. Leon will work on a draft time plan for the coming weeks and send across before our next meeting.
  2. Frans will create a simulated illumina dataset and share it together with the script to Freerk. Will be done by Wednesday, 12 May.
  3. Frans will create a parser to process SAM output into an evaluation report based on the format from Leon. Will be done by Monday, 16 May.

New action points:

  1. Leon will add a method to retrieve average/peak CPU usage into the Perl profiler and send it to Freerk.
  2. Leon will check with other NBIC staffs on getting a real testing dataset.
  3. Freerk will continue implementing the Galaxy wrapper. He will check and discuss with Frans on dataset, parameter settings, and the SAM parser.
  4. Frans will check and update the format of evaluation report to reflect the use of simulated dataset.
  5. We will evaluate the possibility of a running demo at 7 June on 25 May.
18 May

Discussions:

  1. Frans had some problem with network harddisk and part of dataset is lost. He needs to restart the script to create a new data set. FASTQ, 76 base per read with correct positions.
  2. The FASTQ format is agreed: http://maq.sourceforge.net/fastq.shtml
  3. We discussed the evaluation report format, "Average_Coverage" seems not so relevant for the simulated dataset. We probably can leave it out.
  4. We will aim to have all the necessary input modules and datasets ready and start the integration next week.

Unfinished action points:

  1. Frans will create a simulated illumina dataset (76 base and corrected position information) and share it together with the script to Freerk. Will be done by Thursday, 20 May.
  2. Frans will create a parser to process SAM output into an evaluation report based on the format from Leon. Will be done by Thursday, 20 May.
  3. Freerk will continue implementing the Galaxy wrapper. He will check and discuss with Frans on dataset, parameter settings, and the SAM parser.
  4. We will evaluate the possibility of a running demo at 7 June on 25 May.

New action points:

  1. We will discuss the wrapper interface on 21 May.
  2. Frans will check the evaluation report format and update it if necessary during the implementation of the SAM parser.
  3. Freerk will check and test the Perl profiler.
  4. Freerk will check the list of required modules and time plans. He will try to make sure every important input is addressed so that we could start the integration next week.
1 June

Discussions:

  1. Freerk will deliver his thesis in two weeks. So he can't spend too much time on the wrapper.
  2. Frans is busy with the NGS work at Hubrecht. So he finished the SAM parser but the simulated dataset is not ready yet. Leon will follow up with Frans on this next week and see if he can help.
  3. We discussed the wrapper interface. We agreed the following for the phase 1 implementation:
    • to remove the option of "default parameters"
    • show cross-tool common parameters, e.g. Seed Length, Number of mismatches/edit distance in the seed, Total number of mismatches in read. Default values will be displayed in the input boxes but user can change them.
    • hide the tool specific parameters in the wrapper interface and just use the default values for them.
  4. Freerk sees all the required modules for the wrapper implementation are ready. He only needs to test the CPU/Mem profiler, e.g. whether all process IDs can be detected. Leon will assist on this.
  5. The demo of wrapper at June 7 is canceled. We will seek for new demo opportunities, e.g. BioAssist meetings or NGS user meeting in July?

Action points:

  1. Frans will create a simulated illumina dataset (76 base and corrected position information) and share it together with the script to Freerk.
  2. Frans will create a small help file for the SAM parser and send them together to Freerk and Leon before this Friday.
  3. Freerk will continue implementing the Galaxy wrapper. He will check and discuss with Frans on dataset, parameter settings, and the SAM parser.
  4. Freerk will check and test the Perl profiler.
15 June

Discussions:

  1. Frans created a simulated illumina dataset. But some reads can be mapped to multiple positions due to the short length (76 bps) and they will reduce the accuracy of the alignment quality. They are being identified using BLAST and removed. This procedure can be finished today.
  2. Frans improved a customized SNP&indel caller for SAM-files. This took most of his time the last few weeks. This script is also required by Hubrecht. Leon asked why not use an exisitng SNP/Indel caller for this. There is also a customized SNP caller in Freerk's group. It becomes obvious that we could do a SNP/indel calling benchmarking on these two customized callers with other public available tools.
  3. Freerk finished his thesis and will have his graduation on June 29th.
  4. Leon is on holidays between June 17th - June 24th. We will have our next skype call on June 25 at 11am.
  5. Leon explained the visualization brainstorming session on July 28. Frans will check if he can attend. Freerk will collect some input from his group and send to Leon.

Action points:

  1. Frans will deliver the simulated dataset.
  2. Frans will evaluate the effort of making an integrate script for producing this simulated dataset.
  3. Frans will deliver the SAM parser.
  4. Freerk will continue implementing the Galaxy wrapper. He will check and discuss with Frans on dataset, parameter settings, and the SAM parser.
  5. Freerk will check and test the Perl profiler.
28 June

Updates from Freerk by emails:

  1. Freerk implemented a first version of Galaxy wrapper. It works and generates data, but Galaxy detects an error in the output. Freerk is working on the solutions.
  2. Freerk had a look at the Perl profiler. However he didn't test it yet due to the lack of Perl Proc library. He is consulting the local system administrator on this.
5 July

Action points:

  1. Frans will deliver the simulated dataset and the SAM parser by today. He is working on the manual at the moment.
  2. Freerk will send the Proc Installation error message to Leon.
  3. Freerk and Leon will look at the Galaxy wrapper output error on this Friday.
3 August

Agenda:

  1. Update on the evaluation wrapper: Integration, Perl profiler, test run with simulated dataset and dataset from Kai.
  2. Demo on August 20
  3. Holidays
  4. Simulated dataset script in collaboration with Kai.

Discussion points:

  1. Leon made an update in the Perl profiler to retrieve the correct process ID.
  2. The integration of latest Perl profiler and the correct setting of MAQ are still need to be done.
  3. Freerk will be on holidays from August 5 till August 25.
  4. Frans and Leon don't have any planned holidays in the coming weeks. So they will take over some integration task from Freerk and aim at a demo on August 20th.

Action points:

  1. Leon will work on the MAQ setting
  2. Leon will visit Frans to work on the wrapper integration on August 10, at 9:30.