• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MakefileH A D17-Apr-2015413 2214

README.mdH A D17-Apr-20155.1 KiB13096

bfc.bibH A D17-Apr-20156.3 KiB223194

bfc.texH A D17-Apr-201515.9 KiB302267

bioinfo.clsH A D17-Apr-201528.6 KiB929791

bioinfo2.clsH A D17-Apr-201528.6 KiB929791

natbib.bstH A D17-Apr-201525.6 KiB1,2891,154

natbib.styH A D17-Apr-201534.2 KiB804800

supp.texH A D17-Apr-201511 KiB210179

README.md

1## Human Data
2
3Raw reads were downloaded from [BaseSpace][basespace] under the sample
4"NA12878-L7" of project "HiSeq X Ten: TruSeq Nano (4 replicates of NA12878)".
5There are 444,764,118 pairs of reads, averaged 150bp or so in length. The raw
6data are kept in two gzip'd files, one file for each read end. As some programs
7work with one input file only or do not recognize gzip'd input files, we
8may decompress the raw data or create one interleaved FASTQ.
9
10## Hardware and Software
11
12Fiona, BBMap and Bloocoo provide precompiled binaries. We compiled the rest of
13tools on a CentOS5 virtual machine. The executables are [available
14here][biobin].  We ran all the tools on a CentOS6 machine with 20 cores of
15Intel E5-2660 CPUs at 2.2GHz and 128GB RAM, and measured timing and peak memory
16with GNU time.
17
18## Error Correction Command Lines
19
20
21```sh
22# BBMap-34.38 (adding ecclimit=14 made little difference)
23ecc.sh -Xmx32g in=read1.fastq.gz in2=read2.fastq.gz out=ec.fq tmpdir=tmp threads=16 ecc=t aec=t k=31
24
25# BFC-r175
26bash -c "bfc -s 3g -k55 -t 16 <(seqtk mergepe read1.fq.gz read2.fq.gz) <(seqtk mergepe read1.fq.gz read2.fq.gz) | gzip -1 > ec.fq.gz"
27
28# BFC-kmc
29echo -e "read1.fq.gz\nread2.fq.gz" > list.txt
30kmc -k55 -m24 @list.txt reads.k55 tmp
31seqtk mergepe read1.fq.gz read2.fq.gz | bfc-kmc -t16 reads.k55 | gzip -1 > ec.fq.gz
32
33# BLESS-v0p23 (much faster than v0p17 or earlier)
34bless -read1 read1.fq -read2 read2.fq -kmerlength 55 -prefix out -smpthread 16 -max_mem 24 -notrim
35
36# Bloocoo-1.0.4
37Bloocoo -nb-cores 16 -file read12.fq.gz -kmer-size 31
38
39# Fermi2-r175; ropebwt2-r187
40seqtk mergepe read1.fq.gz read2.fq.gz | ropebwt2 -drq20 -x31 > index.fmd
41seqtk mergepe read1.fq.gz read2.fq.gz | fermi2 correct -t 16 -k 29 index.fmd /dev/stdin | gzip -1 > ec.fq.gz
42
43# Fiona-0.2.0 (killed due to large memory footprint)
44fiona -g 3000000000 --sequencing-technology illumina --no-final-trim-ns -nt 16 read12.fq ec.fq
45
46# Lighter-20150123
47lighter -K 31 3000000000 -r read1.fq.gz -r read2.fq.gz -t 16
48
49# Musket-1.1
50musket -k 27 3000000000 -p 16 -inorder -o ec.fq read12.fq
51
52# QuorUM-1.0.0
53quorum -s 12000000000 -t 16 -p quorum-k31 -k 31 read1.fq read2.fq
54
55# SGA-0.9.13
56sga preprocess -p 1 read1.fq.gz read2.fq.gz | gzip -1 > out.pe.fq.gz
57sga index -a ropebwt -t 16 --no-reverse out.pe.fq.gz
58sga correct -t 16 -k 55 --learn out.pe.fq.gz
59
60# Trowel-0.1.4.1 (killed due to large memory footprint)
61echo "read1.fq read2.fq" > list.txt
62trowel -t 16 -f list.txt -k 31 -ntr
63```
64
65## Assembly-related Command Lines
66
67```sh
68# Velvet-1.2.10 assembly
69velveth k61 61 -shortPaired -fmtAuto -interleaved SRR065390.fq
70velvetg k61 -exp_cov auto -scaffolding yes -cov_cutoff auto -ins_length 250
71
72# ABySS-1.5.2 assembly
73abyss-pe name=k67 k=67 in=SRR065390.fq q=0 s=500 n=5
74
75# Fermikit-0.9 assembly (option -E to skip error correction)
76fermi2.pl unitig -Es100m -t16 SRR065390.fq > SRR065390.mak; make -f SRR065390.mak
77
78# Contig-to-reference mapping (reference version 235)
79bwa mem -x intractg -t 8 celegans.fa contigs.mag.gz > contigs.sam
80
81# Aligned-N50 and break points (htsbox binary included in fermikit-0.9)
82htsbox abreak -l200 contigs.sam
83
84# Variant calling from short reads with freebayes-0.9.20
85freebayes --experimental-gls -f celegans.fa reads.bam
86
87# Variant calling from contigs
88htsbox pileup -cuf celegans.fa contigs.bam > contigs.vcf
89
90# Variant filtering
91k8 hapdip.js deovlp contigs.vcf | k8 hapdip.js anno > contigs.anno
92k8 hapdip.js filter -DF36 -q3 contigs.anno > contigs.flt
93
94# Comparison to freebayes calls (all.bed is a BED including whole genome regions)
95k8 hapdip.js distEval -b all.bed freebayes.raw.vcf contigs.vcf
96```
97
98
99## Comments
100
1011. Early versions of BLESS were single-threaded and slow ([Molnar and Ilie,
102   2014][review]), and did not work with unequal read lengths. More recent
103   versions are much better.
104
1052. Lighter-1.0.4 is slow on gzip'd input, so I raised issues [#13][i13] and
106   [#14][i14]. The version checked out on 2015-01-23 and compiled with
107   the latest zlib is much more performant on gzip'd input.
108
1093. In addition to tools shown in the table, we have also tried
110   [Trowel-0.1.4.1][trowel], [Fiona-0.2.0][fiona] and
111   [AllPathsLG-51828][allpath]. They were taking over 110GB RAM and got
112   manually killed. [Coral][coral], [HiTEC][hitec] and [SHREC][shrec] are
113   unable to work with data at this scale according to a few publications.
114   [RACER][racer] requires more RAM than the total read bases (>120GB)
115   [according to][review] the developers. These tools are not tested.
116
117[basespace]: https://basespace.illumina.com/datacentral
118[biobin]: https://sourceforge.net/projects/biobin/
119[review]: http://bib.oxfordjournals.org/content/early/2014/09/01/bib.bbu029
120[trowel]: https://sourceforge.net/projects/trowel-ec/
121[fiona]: http://www.seqan.de/projects/fiona/
122[allpath]: http://www.broadinstitute.org/software/allpaths-lg/blog/
123[racer]: http://www.csd.uwo.ca/~ilie/RACER/
124[time]: https://ftp.gnu.org/gnu/time/
125[coral]: http://www.cs.helsinki.fi/u/lmsalmel/coral/
126[hitec]: http://www.csd.uwo.ca/~ilie/HiTEC/
127[shrec]: https://sourceforge.net/projects/shrec-ec/
128[i13]: https://github.com/mourisl/Lighter/issues/13
129[i14]: https://github.com/mourisl/Lighter/issues/14
130