1#dDocent Filtering Tutorial
2Designed by Jon Puritz
3
4# GOALS
51. To learn how to use VCFtools to filter a VCF file for missing data, genotype depth, locus quality score, minor allele frequency, and genotype call depth
62. To learn how to use vcflib to filter FreeBayes VCF files generated with RAD data
73. To filter a VCF file for HWE within populations
84. How to decompose a VCF into SNPs and INDELs and
95. How to use a haplotyping script to further filter SNPs for paralogs and genotyping errors.
10
11***
12
13# Tutorial
14Welcome to the SNP filtering exercise.  For the first part of the exercise, the filtering steps should work on almost any VCF file.
15For the second part of the exercise, we are going to assume you are working with a VCF file that was generated by
16FreeBayes.  Note that other SNP callers can be configured to include the same annotations.
17Let's find our way back to your original working directory and make a new filtering directory
18
19```bash
20mkdir filtering
21cd filtering
22```
23
24Now, let's download some data to look at.
25
26```bash
27curl -L -o data.zip https://www.dropbox.com/sh/bf9jxviaoq57s5v/AAD2Kv5SPpHlZ7LC7sBz4va8a?dl=1
28unzip data.zip
29ll
30```
31
32```bash
33total 165620
34-rwxr--r--. 1 jpuritz users   109633 Mar  6 14:56 BR_004-RG.bam
35-rwxr--r--. 1 jpuritz users   247496 Mar  6 14:57 BR_004-RG.bam.bai
36-rwxr--r--. 1 jpuritz users   120045 Mar  6 15:14 BR_006-RG.bam
37-rwxr--r--. 1 jpuritz users   247712 Mar  6 15:14 BR_006-RG.bam.bai
38-rw-r--r--. 1 jpuritz users 78979977 Mar  6 16:15 data.zip
39drwxr-xr-x. 2 jpuritz users       21 Mar  6 16:16 __MACOSX
40-rwxr--r--. 1 jpuritz users      399 Mar  6 15:08 popmap
41-rwxr--r--. 1 jpuritz users 68264393 Mar  6 13:40 raw.vcf.gz
42-rwxr--r--. 1 jpuritz users  6804314 Mar  6 14:49 reference.fasta
43-rwxr--r--. 1 jpuritz users  1085387 Mar  6 14:49 reference.fasta.amb
44-rwxr--r--. 1 jpuritz users  1068884 Mar  6 14:49 reference.fasta.ann
45-rwxr--r--. 1 jpuritz users  6379720 Mar  6 14:49 reference.fasta.bwt
46-rwxr--r--. 1 jpuritz users   353544 Mar  6 14:49 reference.fasta.clstr
47-rwxr--r--. 1 jpuritz users   976388 Mar  6 14:49 reference.fasta.fai
48-rwxr--r--. 1 jpuritz users  1594913 Mar  6 14:49 reference.fasta.pac
49-rwxr--r--. 1 jpuritz users  3189872 Mar  6 14:49 reference.fasta.sa
50-rwxr--r--. 1 jpuritz users   137209 Mar  6 14:30 stats.out
51```
52
53To start, we are going to use the program VCFtools (http://vcftools.sourceforge.net) to filter our vcf file.  This program has a binary executable
54and has several perl scripts as well that are useful for filtering.
55I find it much more useful to use version 0.1.11, since it has more useful filtering commands (I think).  Let's load that version
56This raw.vcf file is going to have a lot of erroneous variant calls and a lot of variants that are only present in one individual.
57To make this file more manageable, let's start by applying three step filter.  We are going to only keep variants that have been successfully genotyped in
5850% of individuals, a minimum quality score of 30, and a minor allele count of 3.
59
60```
61vcftools --gzvcf raw.vcf.gz --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3
62```
63In this code, we call vcftools, feed it a vcf file after the `--vcf` flag, `--max-missing 0.5` tells it to filter genotypes called below 50% (across all individuals)
64the `--mac 3` flag tells it to filter SNPs that have a minor allele count less than 3.
65This is relative to genotypes, so it has to be called in at least 1 homozygote and 1 heterozygote or 3 heterozygotes.
66The `--recode` flag tells the program to write a new vcf file with the filters, `--recode-INFO-all` keeps all the INFO flags from the old vcf file in the new one.
67Lastly, `--out` designates the name of the output
68The output will scroll through a lot of lines, but should end like:
69
70```
71After filtering, kept 40 out of 40 Individuals
72After filtering, kept 78434 out of a possible 147540 Sites
73Outputting VCF file... Done
74Run Time = 40.00 seconds
75```
76
77Those two simple filters got rid of 50% of the data and will make the next filtering steps run much faster.
78
79We now have a filtered VCF called raw.g5mac3.recode.vcf.  There is also a logfile generated called raw.g5mac3.log
80The next filter we will apply is a minimum depth for a genotype call and a minimum mean depth
81
82```bash
83vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3
84```
85This command will recode genotypes that have less than 3 reads.
86I'll give you a second to take a deep breath.
87Yes, we are keeping genotypes with as few as 3 reads.  We talked about this in the lecture portion of this course, but the short answer is that
88sophisticated multisample variant callers like FreeBayes and GATK can confidently call genotypes with few reads because variants are assessed across all
89samples simultaneously.  So, the genotype is based on three reads AND prior information from all reads from all individuals.  Relax.  We will do plenty of other filtering steps!
90
91Don't believe me do you?  I've made a script to help evaluate the potential errors.
92
93```bash
94curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/ErrorCount.sh
95chmod +x ErrorCount.sh
96./ErrorCount.sh raw.g5mac3dp3.recode.vcf
97```
98
99```bash
100This script counts the number of potential genotyping errors due to low read depth
101It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability.
102Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0
103Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0
104Potential genotyping errors from genotypes from only 3 reads range from 15986 to 53714.22
105Potential genotyping errors from genotypes from only 4 reads range from 6230 to 31502.04
106Potential genotyping errors from genotypes from only 5 reads range from 2493 to 18914
10740 number of individuals and 78434 equals 3137360 total genotypes
108Total genotypes not counting missing data 2380094
109Total potential error rate is between 0.0103815227466 and 0.0437504821238
110SCORCHED EARTH SCENARIO
111WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS?????
112The total SCORCHED EARTH error rate is 0.129149100834.
113```
114Right now, the maximum error rate for our VCF file because of genotypes less than 5 reads is less than 5%.  See, nothing to worry about.
115
116The next step is to get rid of individuals that did not sequence well.  We can do this by assessing individual levels of missing data.
117
118```bash
119vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv
120```
121This will create an output called out.imiss.  Let's examine it.
122
123```bash
124cat out.imiss
125```
126```
127INDV	N_DATA	N_GENOTYPES_FILTERED	N_MISS	F_MISS
128BR_002	78434	0	13063	0.166548
129BR_004	78434	0	16084	0.205064
130BR_006	78434	0	25029	0.319109
131BR_009	78434	0	30481	0.38862
132BR_013	78434	0	69317	0.883762
133BR_015	78434	0	8861	0.112974
134BR_016	78434	0	29789	0.379797
135BR_021	78434	0	17422	0.222123
136BR_023	78434	0	43913	0.559872
137BR_024	78434	0	24220	0.308795
138BR_025	78434	0	21998	0.280465
139BR_028	78434	0	26786	0.34151
140BR_030	78434	0	74724	0.952699
141BR_031	78434	0	26488	0.337711
142BR_040	78434	0	19492	0.248515
143BR_041	78434	0	17107	0.218107
144BR_043	78434	0	16384	0.208889
145BR_046	78434	0	28770	0.366805
146BR_047	78434	0	13258	0.169034
147BR_048	78434	0	24505	0.312428
148WL_031	78434	0	22566	0.287707
149WL_032	78434	0	22604	0.288191
150WL_054	78434	0	32902	0.419486
151WL_056	78434	0	34106	0.434837
152WL_057	78434	0	37556	0.478823
153WL_058	78434	0	31448	0.400949
154WL_061	78434	0	35671	0.45479
155WL_064	78434	0	47816	0.609634
156WL_066	78434	0	10062	0.128286
157WL_067	78434	0	47940	0.611215
158WL_069	78434	0	38260	0.487799
159WL_070	78434	0	21188	0.270138
160WL_071	78434	0	16692	0.212816
161WL_072	78434	0	46347	0.590904
162WL_076	78434	0	78178	0.996736
163WL_077	78434	0	55193	0.703687
164WL_078	78434	0	54400	0.693577
165WL_079	78434	0	19457	0.248068
166WL_080	78434	0	30076	0.383456
167WL_081	78434	0	30334	0.386746
168```
169You can see that some individuals have as high as 99.6% missing data.  We definitely want to filter those out.  Let's take a look at a histogram
170
171```bash
172mawk '!/IN/' out.imiss | cut -f5 > totalmissing
173gnuplot << \EOF
174set terminal dumb size 120, 30
175set autoscale
176unset label
177set title "Histogram of % missing data per individual"
178set ylabel "Number of Occurrences"
179set xlabel "% of missing data"
180#set yr [0:100000]
181binwidth=0.01
182bin(x,width)=width*floor(x/width) + binwidth/2.0
183plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
184pause -1
185EOF
186```
187
188```bash
189                                       Histogram of % missing data per individual
190Number of Occurrences
191    3 ++----------+---------***---------***-----------+------------+-----------+-----------+-----------+----------++
192      +           +         * *         * *           +       'totalmissing' using (bin($1,binwidth)):(1.0) ****** +
193      |                     * *         * *                                                                        |
194      |                     * *         * *                                                                        |
195      |                     * *         * *                                                                        |
196      |                     * *         * *                                                                        |
197  2.5 ++                    * *         * *                                                                       ++
198      |                     * *         * *                                                                        |
199      |                     * *         * *                                                                        |
200      |                     * *         * *                                                                        |
201      |                     * *         * *                                                                        |
202      |                     * *         * *                                                                        |
203    2 ++   ***************  * ****      * *                                                                       ++
204      |    *    *  * **  *  * ** *      * *                                                                        |
205      |    *    *  * **  *  * ** *      * *                                                                        |
206      |    *    *  * **  *  * ** *      * *                                                                        |
207      |    *    *  * **  *  * ** *      * *                                                                        |
208      |    *    *  * **  *  * ** *      * *                                                                        |
209  1.5 ++   *    *  * **  *  * ** *      * *                                                                       ++
210      |    *    *  * **  *  * ** *      * *                                                                        |
211      |    *    *  * **  *  * ** *      * *                                                                        |
212      |    *    *  * **  *  * ** *      * *                                                                        |
213      |    *    *  * **  *  * ** *      * *                                                                        |
214      +    *    * +* **  *  * ** *      * *           +            +           +           +           +           +
215    1 +*************************************************************************************************************
216     0.1         0.2         0.3         0.4         0.5          0.6         0.7         0.8         0.9          1
217                                                    % of missing data
218```
219Most of the individuals have less than 0.5 missing data.  That is probably a good cutoff to use for the moment.
220
221Now we need to create a list of individuals with more than 50% missing data.  Anyone have any ideas?
222
223
224We can use mawk to do it.
225
226```bash
227mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv
228```
229Take a moment to think about what this code is doing.
230
231Now that we have a list of individuals to remove, we can feed that directly into VCFtools for filtering.
232
233```bash
234vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm
235```
236As you can see from the output, this removed 9 individuals.
237
238I've included a script called filter_missing_ind.sh that will automate this process for you in the future.  Try it out.
239
240```bash
241curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_missing_ind.sh
242chmod +x filter_missing_ind.sh
243./filter_missing_ind.sh raw.g5mac3dp3.recode.vcf DP3g95maf05
244```
245The command always follows the structure of filter_missing_ind.sh vcf_to_filter name_prefix_for_new_vcf
246The script prints out a histogram like the one above and also calculates the 85% for missing data.
247Enter "no"
248Now that we have removed poor coverage individuals, we can restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes
249
250```bash
251vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 20
252```
253This leaves us with about 12,754 loci in our filtered vcf file.
254
255This applied a genotype call rate (95%) across all individuals.  With two localities, this is sufficient, but when you have multiple localities being sampled
256You are also going to want to filter by a population specific call rate.  VCFtools won't calculate this directly, but it is an easy workaround.
257First we need a file to define localities (populations).  Most programs want the file to have two tab separated columns.  First with the sample name, second with population assignment.
258I've already made one for this exercise.
259
260```
261cat popmap
262```
263```
264BR_002	BR
265BR_004	BR
266BR_006	BR
267BR_009	BR
268BR_013	BR
269BR_015	BR
270BR_016	BR
271BR_021	BR
272BR_023	BR
273BR_024	BR
274BR_025	BR
275BR_028	BR
276BR_030	BR
277BR_031	BR
278BR_040	BR
279BR_041	BR
280BR_043	BR
281BR_046	BR
282BR_047	BR
283BR_048	BR
284WL_031	WL
285WL_032	WL
286WL_054	WL
287WL_056	WL
288WL_057	WL
289WL_058	WL
290WL_061	WL
291WL_064	WL
292WL_066	WL
293WL_067	WL
294WL_069	WL
295WL_070	WL
296WL_071	WL
297WL_072	WL
298WL_076	WL
299WL_077	WL
300WL_078	WL
301WL_079	WL
302WL_080	WL
303WL_081	WL
304```
305Now we need to create two lists that have just the individual names for each population
306
307```bash
308mawk '$2 == "BR"' popmap > 1.keep && mawk '$2 == "WL"' popmap > 2.keep
309```
310The above line demonstrates the use of && to simultaneous execute two tasks.
311
312Next, we use VCFtools to estimate missing data for loci in each population
313
314```bash
315vcftools --vcf DP3g95maf05.recode.vcf --keep 1.keep --missing-site --out 1
316vcftools --vcf DP3g95maf05.recode.vcf --keep 2.keep --missing-site --out 2
317```
318This will generate files named 1.lmiss and 2.lmiss
319
320They follow this format
321
322```
323head -3 1.lmiss
324```
325```bash
326CHR     POS     N_DATA  N_GENOTYPE_FILTERED     N_MISS  F_MISS
327E1_L101 9       34      0                       0       0
328E1_L101 15      34      0                       0       0
329```
330I added extra tabs to make this easier to read, but what we are interested in is that last column with is the percentage of missing data for that locus.
331We can combine the two files and make a list of loci about the threshold of 10% missing data to remove.  Note this is double the overall rate of missing data.
332
333```bash
334cat 1.lmiss 2.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
335```
336Who can walk us through that line of code?
337
338We then feed this file back into VCFtools to remove any of the loci
339
340```bash
341vcftools --vcf DP3g95maf05.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf05
342```
343
344Again, we only had two populations so our overall filter caught all of these.  However, this will not be the case in multi-locality studies
345I also have made a script to automate this process as well.  It's called pop_missing_filter.sh
346Executing it with no parameters will give you the usage.
347
348```bash
349curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/pop_missing_filter.sh
350chmod +x pop_missing_filter.sh
351./pop_missing_filter.sh
352```
353
354```
355Usage is pop_missing_filter vcffile popmap percent_missing_per_pop number_of_pops_for_cutoff name_for_output
356```
357***
358**From this point forward, the filtering steps assume that the vcf file was generated by FreeBayes**
359Note that other SNP callers can be configured to include the similar annotations.
360
361FreeBayes outputs a lot of information about a locus in the VCF file, using this information and the properties of RADseq, we add some sophisticated filters to the data.
362Let's take a look at the header of our VCF file and take a quick look at all the information.
363
364```bash
365mawk '/#/' DP3g95maf05.recode.vcf
366```
367This will output several lines of INFO tags, I have highlighted a few below:
368
369```
370INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
371INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
372INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
373INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
374INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
375INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
376INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
377INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
378INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
379INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
380INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
381INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">
382INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
383INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
384INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
385FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
386FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
387FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
388FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
389FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
390FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
391FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
392FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
393```
394The first filter we will apply will be on allele balance.  Allele balance is:
395a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous
396Because RADseq targets specific locations of the genome, we expect that the allele balance in our data (for real loci) should be close to 0.5
397We can use the vcffilter program from vcflib.  (https://github.com/ekg/vcflib)
398Typing it with no parameters will give you the usage.
399
400```bash
401vcffilter
402```
403```
404usage: vcffilter [options] <vcf file>
405
406options:
407    -f, --info-filter     specifies a filter to apply to the info fields of records,
408                          removes alleles which do not pass the filter
409    -g, --genotype-filter specifies a filter to apply to the genotype fields of records
410    -k, --keep-info       used in conjunction with '-g', keeps variant info, but removes genotype
411    -s, --filter-sites    filter entire records, not just alleles
412    -t, --tag-pass        tag vcf records as positively filtered with this tag, print all records
413    -F, --tag-fail        tag vcf records as negatively filtered with this tag, print all records
414    -A, --append-filter   append the existing filter tag, don't just replace it
415    -a, --allele-tag      apply -t on a per-allele basis.  adds or sets the corresponding INFO field tag
416    -v, --invert          inverts the filter, e.g. grep -v
417    -o, --or              use logical OR instead of AND to combine filters
418    -r, --region          specify a region on which to target the filtering, requires a BGZF
419                          compressed file which has been indexed with tabix.  any number of
420                          regions may be specified.
421```
422Let's use our first filter
423
424```bash
425vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95p5maf05.recode.vcf > DP3g95p5maf05.fil1.vcf
426```
427vcffilter works with simple conditional statements, so this filters out loci with an allele balance below 0.25 and above 0.75.  However, it does include those that are close to zero.
428The last condition is to catch loci that are fixed variants (all individuals are homozygous for one of the two variants).
429The `-s` tells the filter to apply to sites, not just alleles
430To see how many loci are now in the VCF file, you could feed it into VCFtools or you can just use a simple mawk statement
431
432```bash
433mawk '!/#/' DP3g95p5maf05.recode.vcf | wc -l
43412754
435```
436
437```bash
438mawk '!/#/' DP3g95p5maf05.fil1.vcf | wc -l
4399678
440```
441You'll notice that we've filtered a lot of loci.  In my experience though, I find that most of these tend to be errors of some kind.  However, this will be data dependent.  I encourage you to explore your own data sets.
442
443The next filter we will apply filters out sites that have reads from both strands.  For GWAS and even RNAseq, this can be a good thing.
444Unless you are using super small genomic fragment or really long reads (MiSeq).  A SNP should be covered only by forward or only reverse reads.
445
446```bash
447vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s DP3g95p5maf05.fil1.vcf > DP3g95p5maf05.fil2.vcf
448```
449The filter is based on proportions, so that a few extraneous reads won't remove an entire locus.  In plain english, it's keeping loci that have over 100 times more forward alternate reads than reverse alternate reads and 100 times more forward reference reads than reverse reference reads along with the reciprocal.
450
451```bash
452mawk '!/#/' DP3g95p5maf05.fil2.vcf | wc -l
4539491
454```
455That only removes a small proportion of loci, but these loci are likely to be paralogs, microbe contamination, or weird PCR chimeras.
456
457The program SAMtools is a great way to visualize alignments right from the terminal.
458
459```bash
460samtools tview BR_006-RG.bam reference.fasta -p E28188_L151
461```
462
463```bash
464         11        21        31        41        51        61        71        81        91        101       111       121	 131
465AATTCTCAGAGCTAGAGTGGGGACGGCAGTTGGTAGAGGGTACAGCAGTTCTAAAAACATGTAGAAATTTTCTCTTCAACTCGCTCCTACGGCCACAGCGTTCACTCCACATACACAAATTGTACACCAAAACATAGGAAAAG
466...........S...........Y.K......S.........G.......K.........S............................Y........Y....W.........................M...G.........
467..........................................G.......G......................................T.....
468..........................................G.......T............................................
469...........G...........T.T......C.........G.................C...............................
470..........................................G.......T....................................G.......
471..........................................G.......T............................................
472...........G...........T.T......C.........G.................C..................................
473..........................................G.......T............................................
474...........G.................C............G.......G.................******........A...............T..
475                                            ,,,,,,g,,,,,,,,,,,,,,,,,******,,,,,,,,a,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
476                                                  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,,
477                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
478                                                  ,,,,,,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
479                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
480                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
481                                                  g,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
482                                                  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,,
483                                                  ,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,,
484                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
485                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,a,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
486                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
487                                                  ,,,,,,,,,,c,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
488                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
489                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
490                                                  g,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
491                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
492```
493As you can see this is a mess.  There appear to be  over two haplotypes here.
494For more info on how to use samtools tview press the question mark while you are in the window.
495
496The next filter looks at the ratio of mapping qualities between reference and alternate alleles
497
498```bash
499vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" DP3g95p5maf05.fil2.vcf > DP3g95p5maf05.fil3.vcf
500```
501The rationale here is that, again, because RADseq loci and alleles all should start from the same genomic location there should not be large discrepancy
502between the mapping qualities of two alleles.
503
504```bash
505mawk '!/#/' DP3g95p5maf05.fil3.vcf | wc -l
5069229
507```
508This filters away less than 3% of the variants, but likely need to be filtered.  Let's take a look at one
509
510```bash
511samtools tview BR_004-RG.bam reference.fasta -p E20_L173
512```
513
514```bash
5151         11        21        31        41        51        61        71        81        91        101       111	121	  131
516NAATTCATCTGTTGCAGGCAGCTCACACTTGCAGCCTCGGCTCGCACCAGCAGAGCAGCCGTAGAATACTTAGTTTAATAGAATGGCTTGGCATTTNNNNNNNNNNCATGAGGTTGTTATTCTCAGAAGACTAATCACAGACA
517 .......Y.........YM....WS...Y....S...R....R....................................................C         ....................G................
518 .......T.........TC....TG...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
519 .......T.........TC.....G...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
520 .......T.........TC....TG...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
521 .......T.........TC.....G...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
522 .......T.........TC.....G...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
523 .......T.........TC....TG...C....G...A....A..................................
524 .......T.........TC.....G...C....G...A....A..................................
525 .......T.........TC.....G...C....G...A....A..................................
526 .......T.........TC....TG...C....G...A....A..................................
527 .......T.........TC....TG...C....G...A....A..................................
528 ..........C....................................................................................C
529 .......T.........TC.....G...C....G...A....A..................................
530 ..........C....................................................................................C
531 .......T.........TC.....G...C....G...A....A..................................
532 ......GT.........TC.....G...C....G...A....A..................................
533 .......T.........TC....TG...C....G...A....A..................................
534```
535There is a large amount of clipping going on here for the variant alleles likely why the mapping quality is low for them.  You can also see that there are three different alleles present here.  Press SHIFT+L to scroll further down the alignment.  You can see that some of the polymorphism is also link to a cut site variant.  All things that should be avoided.
536
537Yet another filter that can be applied is whether or not their is a discrepancy in the properly paired status of for reads supporting reference or alternate alleles.
538
539```bash
540vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s DP3g95p5maf05.fil3.vcf > DP3g95p5maf05.fil4.vcf
541```
542Since de novo assembly is not perfect, some loci will only have unpaired reads mapping to them.  This is not a problem.  The problem occurs when all the reads supporting the reference allele
543are paired but not supporting the alternate allele.  That is indicative of a problem.
544
545```bash
546mawk '!/#/' DP3g95p5maf05.fil4.vcf | wc -l
5479166
548```
549Our loci count keeps dwindling, but our signal to noise ration keeps increasing.  Let's look at an example of what we filtered.
550
551```bash
552samtools tview BR_006-RG.bam reference.fasta -p E4407_L138
553```
554This output doesn't paste well to the terminal, but you can see the clear discrepancy between mapping status and allele status.  This could be indicative of cut site polymorphism or paralogs.
555The next filter we will apply is to look at the ration of locus quality score to depth
556Heng Li found some interesting results about how quality score and locus depth are related to each other in real and spurious variant calls
557See his preprint here (http://arxiv.org/pdf/1404.0929.pdf)
558Also see this great blog post about it here (http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/)  I **REALLY** recommend following that blog.  Brad Chapman's group is really good.
559In short, with whole genome samples, it was found that high coverage can lead to inflated locus quality scores.  Heng proposed that for read depths greater than the mean depth plus 2-3 times
560the square root of mean depth that the quality score will be twice as large as the depth in real variants and below that value for false variants.
561I actually found that this is a little too conservative for RADseq data, likely because the reads aren't randomly distributed across contigs.  I implement two filters based on this idea.
562the first is removing any locus that has a quality score below 1/4 of the depth.
563
564```bash
565vcffilter -f "QUAL / DP > 0.25" DP3g95p5maf05.fil4.vcf > DP3g95p5maf05.fil5.vcf
566```
567The second is more complicated.  The first step is to create a list of the depth of each locus
568
569```bash
570cut -f8 DP3g95p5maf05.fil5.vcf | grep -oe "DP=[0-9]*" | sed -s 's/DP=//g' > DP3g95p5maf05.fil5.DEPTH
571```
572The second step is to create a list of quality scores.
573
574```bash
575mawk '!/#/' DP3g95p5maf05.fil5.vcf | cut -f1,2,6 > DP3g95p5maf05.fil5.vcf.loci.qual
576```
577Next step is to calculate the mean depth
578
579```bash
580mawk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' DP3g95p5maf05.fil5.DEPTH
5811952.82
582```
583Now the the mean plus 3X the square of the mean
584
585```bash
586python -c "print int(1952+3*(1952**0.5))"
5872084
588```
589Next we paste the depth and quality files together and find the loci above the cutoff that do not have quality scores 2 times the depth
590
591```bash
592paste DP3g95p5maf05.fil5.vcf.loci.qual DP3g95p5maf05.fil5.DEPTH | mawk -v x=2084 '$4 > x' | mawk '$3 < 2 * $4' > DP3g95p5maf05.fil5.lowQDloci
593```
594Now we can remove those sites and recalculate the depth across loci with VCFtools
595
596```bash
597vcftools --vcf DP3g95p5maf05.fil5.vcf --site-depth --exclude-positions DP3g95p5maf05.fil5.lowQDloci --out DP3g95p5maf05.fil5
598```
599Now let's take VCFtools output and cut it to only the depth scores
600
601```bash
602cut -f3 DP3g95p5maf05.fil5.ldepth > DP3g95p5maf05.fil5.site.depth
603```
604Now let's calculate the average depth by dividing the above file by the number of individuals 31
605
606```bash
607mawk '!/D/' DP3g95p5maf05.fil5.site.depth | mawk -v x=31 '{print $1/x}' > meandepthpersite
608```
609Let's plot the data as a histogram
610
611```bash
612gnuplot << \EOF
613set terminal dumb size 120, 30
614set autoscale
615set xrange [10:150]
616unset label
617set title "Histogram of mean depth per site"
618set ylabel "Number of Occurrences"
619set xlabel "Mean Depth"
620binwidth=1
621bin(x,width)=width*floor(x/width) + binwidth/2.0
622set xtics 5
623plot 'meandepthpersite' using (bin($1,binwidth)):(1.0) smooth freq with boxes
624pause -1
625EOF
626```
627
628```bash
629                                              Histogram of mean depth per site
630Number of Occurrences
631  250 ++--+---+---+---+--+---+---+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+--++
632      +   +   +   +   +  +   +   +   +   +   +   +   +   +'meandepthpersite' using (bin($1,binwidth)):(1.0)+****** +
633      |               **    **                                                                                     |
634      |               ***   **                                                                                     |
635      |               ********                                                                                     |
636  200 ++              ********* *                                                                                 ++
637      |           ** ********** *                                                                                  |
638      |           ************* **  *                                                                              |
639      |           ****************  *  *                                                                           |
640      |           *******************  **                                                                          |
641  150 ++          ******************* ***                                                                         ++
642      |           *************************                                                                        |
643      |           ***************************                                                                      |
644      |       *******************************                                                                      |
645  100 ++      **********************************                                                                  ++
646      |       **********************************                                                                   |
647      |       ************************************** *                                                             |
648      |       ************************************** * **                                                          |
649      |       *********************************************                                                        |
650   50 ++      ************************************************                                                    ++
651      |       ************************************************   **   **                                           |
652      |       ***********************************************************  **                                      |
653      |       ****************************************************************  **  **                 **          |
654      +   +   ************************************************************************************** ******+********
655    0 ++--+---******************************************************************************************************
656      10  15  20  25  30 35  40  45  50  55  60  65  70  75  80 85  90  95 100 105 110 115 120 125130 135 140 145 150
657                                                        Mean Depth
658```
659Loci that have high mean depth are indicative of either paralogs or multicopy loci.  Either way we want to remove them.  Here, I'd remove all loci above a mean depth of 102.5.  Now we can combine both filters to produce another VCF file
660
661```bash
662vcftools --vcf  DP3g95p5maf05.fil5.vcf --recode-INFO-all --out DP3g95p5maf05.FIL --max-meanDP 102.5 --exclude-positions DP3g95p5maf05.fil5.lowQDloci --recode
663```
664In the end, VCFtools kept 8417 out of a possible 9164 Sites.
665BTW, I've also written a script to automate the filterings steps described in steps 23-44.  It's called dDocent_filters.  It will go through the filtering steps and
666recode a log file for you for each of the steps, including the depth histogram.
667
668```bash
669curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/dDocent_filters
670chmod +x dDocent_filters
671./dDocent_filters
672```
673```
674This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth,
675quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation.
676The script assumes that loci and individuals with low call rates (or depth) have already been removed.
677
678Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters
679
680Usage is sh FB_filters.sh VCF_file Output_prefix
681```
682
683The next filter to apply is HWE.  Heng Li also found that HWE is another excellent filter to remove erroneous variant calls.
684We don't want to apply it across the board, since population structure will create departures from HWE as well.  We need to apply this by population.
685I've included a perl script written by Chris Hollenbeck, one of the PhD student's in my current lab that will do this for us.
686
687```bash
688curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_hwe_by_pop.pl
689chmod +x filter_hwe_by_pop.pl
690```
691
692```bash
693./filter_hwe_by_pop.pl
694 Usage:
695     filter_hwe_by_pop.pl -v <vcffile> -p <popmap> [options]
696
697     Options: -v <vcffile> input vcf file -p <popmap> tab-separated file of
698     samples and population designations -h [hwe] minimum Hardy-Weinberg
699     p-value cutoff for SNPs -c [cutoff] proportion of all populations that a
700     locus can be below HWE cutoff without being filtered -o [out] name of
701     outfile
702
703 Options:
704     -v, --vcffile
705             VCF input file
706
707     -p, --popmap
708             File with names of individuals and population designations, one
709             per line
710
711     -h, --hwe
712             Minimum cutoff for Hardy-Weinberg p-value (for test as
713             implemented in vcftools) [Default: 0.001]
714
715     -c, --cutoff
716             Proportion of all populations that a locus can be below HWE
717             cutoff without being filtered. For example, choosing 0.5 will
718             filter SNPs that are below the p-value threshold in 50% or more
719             of the populations. [Default: 0.25]
720
721     -o, --out
722             Name of outfile, by vcftools conventions (will be named
723             X.recode.vcf)
724```
725Let's filter our SNPs by population specific HWE
726First, we need to convert our variant calls to SNPs
727To do this we will use another command from vcflib called vcfallelicprimatives
728
729```bash
730vcfallelicprimitives DP3g95p5maf05.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf05.prim.vcf
731```
732This will decompose complex variant calls into phased SNP and INDEL genotypes and keep the INFO flags for loci and genotypes.
733Next, we can feed this VCF file into VCFtools to remove indels.
734
735```bash
736vcftools --vcf DP3g95p5maf05.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf05
737```
738We now have 8379 SNP calls in our new VCF.
739Now, let's apply the HWE filter
740
741```bash
742./filter_hwe_by_pop.pl -v SNP.DP3g95p5maf05.recode.vcf -p popmap -o SNP.DP3g95p5maf05.HWE -h 0.01
743```
744
745```bash
746Processing population: BR (20 inds)
747Processing population: WL (20 inds)
748Outputting results of HWE test for filtered loci to 'filtered.hwe'
749Kept 8176 of a possible 8379 loci (filtered 203 loci)
750```
751*Note, I would not normally use such a high `-h` value.  It's purely for this example.  Typically, errors would have a low p-value and would be present in many populations.*
752
753We have now created a thoroughly filtered VCF, and we should have confidence in these SNP calls.
754However, our lab is currently developing one more script, called rad_haplotyper.
755This tool takes a VCF file of SNPs and will parse through BAM files looking to link SNPs into haplotypes along paired reads.
756```bash
757curl -L -O https://raw.githubusercontent.com/jpuritz/WinterSchool.2016/master/rad_haplotyper.pl
758chmod +x rad_haplotyper.pl
759
760```
761Note, this script requires several Perl libraries.  See the README [here](https://github.com/jpuritz/NTU-Workshop/blob/master/scripts/README.md)
762It has a lot of options, let's take a look
763
764```bash
765perl rad_haplotyper.pl
766Usage:
767    perl rad_haplotyper.pl -v <vcffile> -r <reference> [options]
768
769    Options: -v <vcffile> input vcf file
770
771             -r     <reference>             reference genome
772
773             -s     [samples]               optionally specify an individual sample to be haplotyped
774
775             -u     [snp_cutoff]            remove loci with more than a specified number of SNPs
776
777             -h     [hap_cutoff]            remove loci with more than a specified number of haplotypes relative to SNPs
778
779             -m     [miss_cutoff]           cutoff for proportion of missing data for loci to be included in the output
780
781             -mp    [max_paralog_inds]              cutoff for excluding possible paralogs
782
783             -ml    [max_low_cov_inds]              cutoff for excluding loci with low coverage or genotyping errors
784
785             -d     [depth]                 sampling depth used by the algorithm to build haplotypes
786
787             -g     [genepop]               genepop file for population output
788
789             -p     [popmap]                population map for organizing Genepop file
790
791             -t     [tsvfile]               tsv file for linkage map output
792
793             -a     [imafile]               IMa file output
794
795             -p1    [parent1]               first parent in the mapping cross
796
797             -p2    [parent2]               second parent in the mapping cross
798
799             -x     [threads]               number of threads to use for the analysis
800
801             -n                             use indels
802
803             -e                             debug
804
805Options:
806    -v, --vcffile
807            VCF input file
808
809    -r, --reference
810            Reference genome (FASTA format)
811
812    -s, --samples
813            Individual samples to use in the analysis - can be used multiple
814            times for multiple individuals [Default: All]
815
816    -u, --cutoff
817            Excludes loci with more than the specified number of SNPs
818            [Default: No filter]
819
820    -h, --hap_count
821            Excludes loci with more than the specified number of haplotypes
822            relative to number of SNPs. Excluding forces other than mutation
823            (i.e. recombination) the maximum number of haplotypes should be
824            one more than the number of SNPs at the locus. The value
825            provided is the number of haplotypes allowed in excess of the
826            number of SNPs, which allows that mechanisms other than mutation
827            may have influenced the number of haplotypes in the population.
828            [Default: 100]
829
830    -x, --threads
831            Run in parallel across individuals with a specified number of
832            threads
833
834    -n, --indels
835            Includes indels that are the only polymorphism at the locus
836            (tag)
837
838    -d, --depth
839            Specify a depth of sampling reads for building haplotypes
840            [Default: 20]
841
842    -m, --miss_cutoff
843            Proportion of missing data cutoff for removing loci from the
844            final output. For example, to keep only loci with successful
845            haplotype builds in 95% of individuals, enter 0.95. [Default:
846            0.9]
847
848    -mp, --max_paralog_inds
849            Count cutoff for removing loci that are possible paralogs from
850            the final output. The value is the maximum allowable number of
851            individuals with more than the expected number of haplotypes
852            [Default: No filter]
853
854    -ml, --max_low_cov_inds
855            Count cutoff for removing loci with low coverage or genotyping
856            errors from the final output. The value is the maximum allowable
857            number of individuals with less than the expected number of
858            haplotypes [Default: No filter]
859
860    -g, --genepop
861            Writes a genepop file using haplotypes. Must provide the name of
862            the genepop file.
863
864    -a, --ima
865            Writes a IMa file using haplotypes. Must provide the name of the
866            IMa file.
867
868    -p, --popmap
869            Tab-separated file of individuals and their population
870            designation, one per line (required for Genepop output)
871
872    -t, --tsvfile
873            Writes a tsv file using haplotypes - for mapping crosses only.
874            Must provide the name of the tsv file.
875
876    -p1, --parent1
877            Parent 1 of the mapping cross (must be specified if writing a
878            tsv file)
879
880    -p2, --parent2
881            Parent 2 of the mapping cross (must be specified if writing a
882            tsv file)
883
884    -e, --debug
885            Output extra logs for debugging purposes
886```
887We don't have enough time to go into depth with all these options and this tool is still under development.
888It also take some substantial resources to run. I will simulate running this for you.
889
890```bash
891#rad_haplotyper.pl -v SNP.DP3g95p5maf05.HWE.recode.vcf -x 40 -mp 1 -u 20 -ml 4 -n -r reference.fasta
892```
893Note, this will not actually run.  It needs all the BAM files to proceed.
894
895Here's the simulated output
896
897```bash
898Removed 0 loci (0 SNPs) with more than 20 SNPs at a locus
899Building haplotypes for BR_024
900Building haplotypes for BR_028
901Building haplotypes for WL_054
902Building haplotypes for BR_016
903Building haplotypes for BR_009
904Building haplotypes for BR_006
905Building haplotypes for BR_041
906Building haplotypes for BR_040
907Building haplotypes for BR_046
908Building haplotypes for BR_031
909Building haplotypes for BR_025
910Building haplotypes for BR_002
911Building haplotypes for WL_058
912Building haplotypes for WL_057
913Building haplotypes for WL_061
914Building haplotypes for WL_069
915Building haplotypes for WL_070
916Building haplotypes for BR_048
917Building haplotypes for WL_031
918Building haplotypes for WL_056
919Building haplotypes for BR_047
920Building haplotypes for WL_079
921Building haplotypes for WL_080
922Building haplotypes for WL_032
923Building haplotypes for WL_071
924Building haplotypes for WL_081
925Building haplotypes for BR_004
926Building haplotypes for BR_021
927Building haplotypes for BR_015
928Building haplotypes for BR_043
929Building haplotypes for WL_066
930Filtered 26 loci below missing data cutoff
931Filtered 66 possible paralogs
932Filtered 17 loci with low coverage or genotyping errors
933Filtered 0 loci with an excess of haplotypes
934```
935The script found another 109 loci to remove from our file.  Besides this output to the terminal, the script outputs a file called stats.out
936
937```bash
938head stats.out
939Locus		Sites	Haplotypes	Inds_Haplotyped	Total_Inds	Prop_Haplotyped	Status	Poss_Paralog	Low_Cov/Geno_Err	Miss_Geno	Comment
940E10001_L101	1		2			30				31			0.968			PASSED	0				0					1
941E10003_L101	7		9			30				31			0.968			PASSED	1				0					0
942E10004_L101	-		-			-				-			-				FILTERED0				0					1			Complex
943E10008_L101	1		2			30				31			0.968			PASSED	0				0					1
944E10013_L142	3		6			30				31			0.968			PASSED	0				0					1
945E10014_L117	2		3			31				31			1.000			PASSED	0				0					0
946E10024_L101	1		2			31				31			1.000			PASSED	0				0					0
947E10029_L101	1		2			31				31			1.000			PASSED	0				0					0
948```
949We can use this file to create a list of loci to filter
950
951```bash
952grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove
953```
954Now that we have the list we can parse through the VCF file and remove the bad RAD loci
955I've made a simple script to do this remove.bad.hap.loci.sh
956
957```bash
958curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/remove.bad.hap.loci.sh
959chmod +x remove.bad.hap.loci.sh
960./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf05.HWE.recode.vcf
961```
962This produces a FINAL FINAL FINAL filtered VCF file	SNP.DP3g95p5maf05.HWE.filtered.vcf
963
964```bash
965mawk '!/#/' SNP.DP3g95p5maf05.HWE.filtered.vcf | wc -l
966```
967We're left with 7,666 SNPs!
968How many possible errors?
969
970```bash
971./ErrorCount.sh SNP.DP3g95p5maf05.HWE.filtered.vcf
972```
973
974```bash
975This script counts the number of potential genotyping errors due to low read depth
976It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability.
977Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0
978Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0
979Potential genotyping errors from genotypes from only 3 reads range from 302 to 1014.72
980Potential genotyping errors from genotypes from only 4 reads range from 162 to 822.232
981Potential genotyping errors from genotypes from only 5 reads range from 88 to 669
98231 number of individuals and 7666 equals 237646 total genotypes
983Total genotypes not counting missing data 237081
984Total potential error rate is between 0.00232831816974 and 0.0105700245908
985SCORCHED EARTH SCENARIO
986WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS?????
987The total SCORCHED EARTH error rate is 0.0330857386294.
988
989```
990
991**Congrats!  You've finished the Filtering Tutorial**
992