1###dDocent Filtering Tutorial###
2###Designed by Jon Puritz###
3
4###GOALS###
5##	1.  To learn how to use VCFtools to filter a VCF file for missing data, genotype depth, locus quality score, minor allele frequency, and genotype call depth
6##	2.	To learn how to use vcflib to filter FreeBayes VCF files generated with RAD data
7##	3.	To filter a VCF file for HWE within populations
8##	4.	How to decompose a VCF into SNPs and INDELs and
9##	5.	How to use a haplotyping script to further filter SNPs for paralogs and genotyping errors.
10###########
11
12##	This file is executable.  Simply type FilterTut [Step_Number] to skip to different steps without opening the whole file.
13## 	Command lines start with a "$"
14
15#1	Welcome to the SNP filtering exercise.  For the first part of the exercise, the filtering steps should work on almost any VCF file.
16#1	For the second part of the exercise, we are going to assume you are working with a VCF file that was generated by
17#1	FreeBayes.  Note that other SNP callers can be configured to include the same annotations.
18#1	Let's find our way back to your original working directory and make a new filtering directory
19#c#1	$mkdir filtering
20#c#1	$cd filtering
21#2
22#2	Now, let's download some data to look at.
23#c#2	$curl -L -o data.zip https://www.dropbox.com/sh/bf9jxviaoq57s5v/AAD2Kv5SPpHlZ7LC7sBz4va8a?dl=1
24#c#2	$unzip data.zip
25#c#3	$ll
26#3	total 165620
27#3	-rwxr--r--. 1 jpuritz users   109633 Mar  6 14:56 BR_004-RG.bam
28#3	-rwxr--r--. 1 jpuritz users   247496 Mar  6 14:57 BR_004-RG.bam.bai
29#3	-rwxr--r--. 1 jpuritz users   120045 Mar  6 15:14 BR_006-RG.bam
30#3	-rwxr--r--. 1 jpuritz users   247712 Mar  6 15:14 BR_006-RG.bam.bai
31#3	-rw-r--r--. 1 jpuritz users 78979977 Mar  6 16:15 data.zip
32#3	drwxr-xr-x. 2 jpuritz users       21 Mar  6 16:16 __MACOSX
33#3	-rwxr--r--. 1 jpuritz users      399 Mar  6 15:08 popmap
34#3	-rwxr--r--. 1 jpuritz users 68264393 Mar  6 13:40 raw.vcf.gz
35#3	-rwxr--r--. 1 jpuritz users  6804314 Mar  6 14:49 reference.fasta
36#3	-rwxr--r--. 1 jpuritz users  1085387 Mar  6 14:49 reference.fasta.amb
37#3	-rwxr--r--. 1 jpuritz users  1068884 Mar  6 14:49 reference.fasta.ann
38#3	-rwxr--r--. 1 jpuritz users  6379720 Mar  6 14:49 reference.fasta.bwt
39#3	-rwxr--r--. 1 jpuritz users   353544 Mar  6 14:49 reference.fasta.clstr
40#3	-rwxr--r--. 1 jpuritz users   976388 Mar  6 14:49 reference.fasta.fai
41#3	-rwxr--r--. 1 jpuritz users  1594913 Mar  6 14:49 reference.fasta.pac
42#3	-rwxr--r--. 1 jpuritz users  3189872 Mar  6 14:49 reference.fasta.sa
43#3	-rwxr--r--. 1 jpuritz users   137209 Mar  6 14:30 stats.out
44#4
45#4	To start, we are going to use the program VCFtools (http://vcftools.sourceforge.net) to filter our vcf file.  This program has a binary executable
46#4	and has several perl scripts as well that are useful for filtering.
47#4	I find it much more useful to use version 0.1.11, since it has more useful filtering commands (I think).  Let's load that version
48#4	This raw.vcf file is going to have a lot of erroneous variant calls and a lot of variants that are only present in one individual.
49#4	To make this file more manageable, let's start by applying three step filter.  We are going to only keep variants that have been successfully genotyped in
50#4	50% of individuals, a minimum quality score of 30, and a minor allele count of 3.
51#c#4	$vcftools --gzvcf raw.vcf.gz --max-missing 0.5 --minQ 30 --mac 3 --recode --recode-INFO-all --out raw.g5mac3
52#4
53#4	In this code, we call vcftools, feed it a vcf file after the --vcf flag, --max-missing 0.5 tells it to filter genotypes called below 50% (across all individuals)
54#4	the --mac 3 flag tells it to filter SNPs that have a minor allele count less than 3.
55#4	This is relative to genotypes, so it has to be called in at least 1 homozygote and 1 heterozygote or 3 heterozygotes.
56#4	The --recode flag tells the program to write a new vcf file with the filters, --recode-INFO-all keeps all the INFO flags from the old vcf file in the new one.
57#4	Lastly --out designates the name of the output
58#4	The output will scroll through a lot of lines, but should end like:
59#4	After filtering, kept 40 out of 40 Individuals
60#4	After filtering, kept 78434 out of a possible 147540 Sites
61#4	Outputting VCF file... Done
62#4	Run Time = 40.00 seconds
63#4	Those two simple filters got rid of 50% of the data and will make the next filtering steps run much faster.
64#5
65#5	We now have a filtered VCF called raw.g5mac3.recode.vcf.  There is also a logfile generated called raw.g5mac3.log
66#5	The next filter we will apply is a minimum depth for a genotype call and a minimum mean depth
67#c#5	$vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3
68#5	This command will recode genotypes that have less than 3 reads.
69#5	I'll give you a second to take a deep breath.
70#5	Yes, we are keeping genotypes with as few as 3 reads.  We talked about this in the lecture portion of this course, but the short answer is that
71#5	sophisticated multisample variant callers like FreeBayes and GATK can confidently call genotypes with few reads because variants are assessed across all
72#5	samples simultaneously.  So, the genotype is based on three reads AND prior information from all reads from all individuals.  Relax.  We will do plenty
73#5	of other filtering steps!
74#6
75#6	Don't believe me do you?  I've made a script to help evaluate the potential errors.
76#c#6	$curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/ErrorCount.sh
77#c#6	$chmod +x ErrorCount.sh
78#c#6	$./ErrorCount.sh raw.g5mac3dp3.recode.vcf
79#6	This script counts the number of potential genotyping errors due to low read depth
80#6	It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability.
81#6	Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0
82#6	Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0
83#6	Potential genotyping errors from genotypes from only 3 reads range from 15986 to 53714.22
84#6	Potential genotyping errors from genotypes from only 4 reads range from 6230 to 31502.04
85#6	Potential genotyping errors from genotypes from only 5 reads range from 2493 to 18914
86#6	40 number of individuals and 78434 equals 3137360 total genotypes
87#6	Total genotypes not counting missing data 2380094
88#6	Total potential error rate is between 0.0103815227466 and 0.0437504821238
89#6
90#6	SCORCHED EARTH SCENARIO
91#6	WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS?????
92#6	The total SCORCHED EARTH error rate is 0.129149100834.
93#6
94#6	Right now, the maximum error rate for our VCF file because of genotypes less than 5 reads is less than 5%.  See, nothing to worry about.
95#7
96#7	The next step is to get rid of individuals that did not sequence well.  We can do this by assessing individual levels of missing data.
97#c#7	$vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv
98#7	This will create an output called out.imiss.  Let's examine it.
99#c#8	$cat out.imiss
100#8
101#8	INDV	N_DATA	N_GENOTYPES_FILTERED	N_MISS	F_MISS
102#8	BR_002	78434	0	13063	0.166548
103#8	BR_004	78434	0	16084	0.205064
104#8	BR_006	78434	0	25029	0.319109
105#8	BR_009	78434	0	30481	0.38862
106#8	BR_013	78434	0	69317	0.883762
107#8	BR_015	78434	0	8861	0.112974
108#8	BR_016	78434	0	29789	0.379797
109#8	BR_021	78434	0	17422	0.222123
110#8	BR_023	78434	0	43913	0.559872
111#8	BR_024	78434	0	24220	0.308795
112#8	BR_025	78434	0	21998	0.280465
113#8	BR_028	78434	0	26786	0.34151
114#8	BR_030	78434	0	74724	0.952699
115#8	BR_031	78434	0	26488	0.337711
116#8	BR_040	78434	0	19492	0.248515
117#8	BR_041	78434	0	17107	0.218107
118#8	BR_043	78434	0	16384	0.208889
119#8	BR_046	78434	0	28770	0.366805
120#8	BR_047	78434	0	13258	0.169034
121#8	BR_048	78434	0	24505	0.312428
122#8	WL_031	78434	0	22566	0.287707
123#8	WL_032	78434	0	22604	0.288191
124#8	WL_054	78434	0	32902	0.419486
125#8	WL_056	78434	0	34106	0.434837
126#8	WL_057	78434	0	37556	0.478823
127#8	WL_058	78434	0	31448	0.400949
128#8	WL_061	78434	0	35671	0.45479
129#8	WL_064	78434	0	47816	0.609634
130#8	WL_066	78434	0	10062	0.128286
131#8	WL_067	78434	0	47940	0.611215
132#8	WL_069	78434	0	38260	0.487799
133#8	WL_070	78434	0	21188	0.270138
134#8	WL_071	78434	0	16692	0.212816
135#8	WL_072	78434	0	46347	0.590904
136#8	WL_076	78434	0	78178	0.996736
137#8	WL_077	78434	0	55193	0.703687
138#8	WL_078	78434	0	54400	0.693577
139#8	WL_079	78434	0	19457	0.248068
140#8	WL_080	78434	0	30076	0.383456
141#8	WL_081	78434	0	30334	0.386746
142#8
143#8 You can see that some individuals have as high as 99.6% missing data.  We definitely want to filter those out.  Let's take a look at a histogram
144#c#9	$mawk '!/IN/' out.imiss | cut -f5 > totalmissing
145#c#9	$gnuplot << \EOF
146#c#9	$set terminal dumb size 120, 30
147#c#9	$set autoscale
148#c#9	$unset label
149#c#9	$set title "Histogram of % missing data per individual"
150#c#9	$set ylabel "Number of Occurrences"
151#c#9	$set xlabel "% of missing data"
152#c#9	$#set yr [0:100000]
153#c#9	$binwidth=0.01
154#c#9	$bin(x,width)=width*floor(x/width) + binwidth/2.0
155#c#9	$plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
156#c#9	$pause -1
157#c#9	$EOF
158#9
159#9                                         Histogram of % missing data per individual
160#9  Number of Occurrences
161#9      3 ++----------+---------***---------***-----------+------------+-----------+-----------+-----------+----------++
162#9        +           +         * *         * *           +       'totalmissing' using (bin($1,binwidth)):(1.0) ****** +
163#9        |                     * *         * *                                                                        |
164#9        |                     * *         * *                                                                        |
165#9        |                     * *         * *                                                                        |
166#9        |                     * *         * *                                                                        |
167#9    2.5 ++                    * *         * *                                                                       ++
168#9        |                     * *         * *                                                                        |
169#9        |                     * *         * *                                                                        |
170#9        |                     * *         * *                                                                        |
171#9        |                     * *         * *                                                                        |
172#9        |                     * *         * *                                                                        |
173#9      2 ++   ***************  * ****      * *                                                                       ++
174#9        |    *    *  * **  *  * ** *      * *                                                                        |
175#9        |    *    *  * **  *  * ** *      * *                                                                        |
176#9        |    *    *  * **  *  * ** *      * *                                                                        |
177#9        |    *    *  * **  *  * ** *      * *                                                                        |
178#9        |    *    *  * **  *  * ** *      * *                                                                        |
179#9    1.5 ++   *    *  * **  *  * ** *      * *                                                                       ++
180#9        |    *    *  * **  *  * ** *      * *                                                                        |
181#9        |    *    *  * **  *  * ** *      * *                                                                        |
182#9        |    *    *  * **  *  * ** *      * *                                                                        |
183#9        |    *    *  * **  *  * ** *      * *                                                                        |
184#9        +    *    * +* **  *  * ** *      * *           +            +           +           +           +           +
185#9      1 +*************************************************************************************************************
186#9       0.1         0.2         0.3         0.4         0.5          0.6         0.7         0.8         0.9          1
187#9                                                      % of missing data
188#9
189#9	Most of the individuals have less than 0.5 missing data.  That is probably a good cutoff to use for the moment.
190#10
191#10	Now we need to create a list of individuals with more than 50% missing data.  Anyone have any ideas?
192#10
193#10
194#10	We can use mawk to do it.
195#c#10	$mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv
196#10	Who can explain what this is doing?
197#11
198#11	Now that we have a list of individuals to remove, we can feed that directly into VCFtools for filtering.
199#c#11	$vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm
200#11	As you can see from the output, this removed 9 individuals.
201#12
202#12	I've included a script called filter_missing_ind.sh that will automate this process for you in the future.  Try it out.
203#c#12	$curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_missing_ind.sh
204#c#12	$chmod +x filter_missing_ind.sh
205#c#12	$./filter_missing_ind.sh raw.g5mac3dp3.recode.vcf DP3g95maf05
206#12	The command always follows the structure of filter_missing_ind.sh vcf_to_filter name_prefix_for_new_vcf
207#12	The script prints out a histogram like the one above and also calculates the 85% for missing data.
208#13	Now that we have removed poor coverage individuals, we can restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes
209#c#13	$vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 20
210#13	This leaves us with about 12,754 loci in our filtered vcf file.
211#14
212#14	This applied a genotype call rate (95%) across all individuals.  With two localities, this is sufficient, but when you have multiple localities being sampled
213#14	You are also going to want to filter by a population specific call rate.  VCFtools won't calculate this directly, but it is an easy workaround.
214#14	First we need a file to define localities (populations).  Most programs want the file to have two tab separated columns.  First with the sample name, second with population assignment.
215#14	I've already made one for this exercise.
216#c#14	$cat popmap
217#14
218#14	BR_002	BR
219#14	BR_004	BR
220#14	BR_006	BR
221#14	BR_009	BR
222#14	BR_013	BR
223#14	BR_015	BR
224#14	BR_016	BR
225#14	BR_021	BR
226#14	BR_023	BR
227#14	BR_024	BR
228#14	BR_025	BR
229#14	BR_028	BR
230#14	BR_030	BR
231#14	BR_031	BR
232#14	BR_040	BR
233#14	BR_041	BR
234#14	BR_043	BR
235#14	BR_046	BR
236#14	BR_047	BR
237#14	BR_048	BR
238#14	WL_031	WL
239#14	WL_032	WL
240#14	WL_054	WL
241#14	WL_056	WL
242#14	WL_057	WL
243#14	WL_058	WL
244#14	WL_061	WL
245#14	WL_064	WL
246#14	WL_066	WL
247#14	WL_067	WL
248#14	WL_069	WL
249#14	WL_070	WL
250#14	WL_071	WL
251#14	WL_072	WL
252#14	WL_076	WL
253#14	WL_077	WL
254#14	WL_078	WL
255#14	WL_079	WL
256#14	WL_080	WL
257#14	WL_081	WL
258#15 Now we need to create two lists that have just the individual names for each population
259#c#15	$mawk '$2 == "BR"' popmap > 1.keep && mawk '$2 == "WL"' popmap > 2.keep
260#15	The above line demonstrates the use of && to simultaneous execute two tasks.
261#16
262#16	Next, we use VCFtools to estimate missing data for loci in each population
263#c#16	$vcftools --vcf DP3g95maf05.recode.vcf --keep 1.keep --missing-site --out 1
264#c#16	$vcftools --vcf DP3g95maf05.recode.vcf --keep 2.keep --missing-site --out 2
265#16	This will generate files named 1.lmiss and 2.lmiss
266#17
267#17	They follow this format
268#c#17	$head -3 1.lmiss
269#17	CHR		POS	N_DATA	N_GENOTYPE_FILTERED	N_MISS	F_MISS
270#17	E1_L101	9	34		0					0		0
271#17	E1_L101	15	34		0					0		0
272#17
273#17	I added extra tabs to make this easier to read, but what we are interested in is that last column with is the percentage of missing data for that locus.
274#18	We can combine the two files and make a list of loci about the threshold of 10% missing data to remove.  Note this is double the overall rate of missing data.
275#c#18	$cat 1.lmiss 2.lmiss | mawk '!/CHROM/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci
276#18	Who can walk us through that line of code?
277#19
278#19	We then feed this file back into VCFtools to remove any of the loci
279#c#19	$vcftools --vcf DP3g95maf05.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf05
280#19	Again, we only had two populations so our overall filter caught all of these.  However, this will not be the case in multi-locality studies
281#19	I also have made a script to automate this process as well.  It's called pop_missing_filter.sh
282#20	Executing it with no parameters will give you the usage.
283#c#20	$curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/pop_missing_filter.sh
284#c#20	$chmod +x pop_missing_filter.sh
285#c#20	$./pop_missing_filter.sh
286#20	Usage is pop_missing_filter vcffile popmap percent_missing_per_pop number_of_pops_for_cutoff name_for_output
287#21
288#21	From this point forward, the filtering steps assume that the vcf file was generated by FreeBayes.
289#21	Note that other SNP callers can be configured to include the similar annotations.
290#21
291#21	FreeBayes outputs a lot of information about a locus in the VCF file, using this information and the properties of RADseq, we add some sophisticated filters to the data.
292#21	Let's take a look at the header of our VCF file and take a quick look at all the information.
293#c#21	$mawk '/#/' DP3g95maf05.recode.vcf
294#21	This will output several lines of INFO tags, I have highlighted a few below:
295#21	##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
296#21	##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
297#21	INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
298#21	##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
299#21	##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
300#21	##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
301#21	##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
302#21	##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
303#21	##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
304#21	##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
305#21	##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
306#21	##INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">
307#21	##INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
308#21	##INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
309#21	##INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
310#21	##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
311#21	##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
312#21	##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
313#21	##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
314#21	##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
315#21	##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
316#21	##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
317#21	##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
318#22
319#22	The first filter we will apply will be on allele balance.  Allele balance is:
320#22	a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous
321#22	Because RADseq targets specific locations of the genome, we expect that the allele balance in our data (for real loci) should be close to 0.5
322#22	We can use the vcffilter program from vcflib.  (https://github.com/ekg/vcflib)
323#22	Typing it with no parameters will give you the usage.
324#c#22	$vcffilter
325#22	usage: vcffilter [options] <vcf file>
326#22
327#22	options:
328#22	    -f, --info-filter     specifies a filter to apply to the info fields of records,
329#22	                          removes alleles which do not pass the filter
330#22	    -g, --genotype-filter specifies a filter to apply to the genotype fields of records
331#22	    -k, --keep-info       used in conjunction with '-g', keeps variant info, but removes genotype
332#22	    -s, --filter-sites    filter entire records, not just alleles
333#22	    -t, --tag-pass        tag vcf records as positively filtered with this tag, print all records
334#22	    -F, --tag-fail        tag vcf records as negatively filtered with this tag, print all records
335#22	    -A, --append-filter   append the existing filter tag, don't just replace it
336#22	    -a, --allele-tag      apply -t on a per-allele basis.  adds or sets the corresponding INFO field tag
337#22	    -v, --invert          inverts the filter, e.g. grep -v
338#22	    -o, --or              use logical OR instead of AND to combine filters
339#22	    -r, --region          specify a region on which to target the filtering, requires a BGZF
340#22	                          compressed file which has been indexed with tabix.  any number of
341#22	                          regions may be specified.
342#23
343#23	Let's use our first filter
344#c#23	$vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95p5maf05.recode.vcf > DP3g95p5maf05.fil1.vcf
345#23	vcffilter works with simple conditional statements, so this filters out loci with an allele balance below 0.25 and above 0.75.  However, it does include those that are close to zero.
346#23	The last condition is to catch loci that are fixed variants (all individuals are homozygous for one of the two variants).
347#23	The -s tells the filter to apply to sites, not just alleles
348#24	To see how many loci are now in the VCF file, you could feed it into VCFtools or you can just use a simple mawk statement
349#c#24	$mawk '!/#/' DP3g95p5maf05.recode.vcf | wc -l
350#24	12754
351#c#24	$mawk '!/#/' DP3g95p5maf05.fil1.vcf | wc -l
352#24	9678
353#24	You'll notice that we've filtered a lot of loci.  In my experience though, I find that most of these tend to be errors of some kind.
354#24	However, this will be data dependent.  I encourage you to explore your own data sets.
355#25
356#25	The next filter we will apply filters out sites that have reads from both strands.  For GWAS and even RNAseq, this can be a good thing.
357#25	Unless you are using super small genomic fragment or really long reads (MiSeq).  A SNP should be covered only by forward or only reverse reads.
358#c#25	$vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s DP3g95p5maf05.fil1.vcf > DP3g95p5maf05.fil2.vcf
359#25	The filter is based on proportions, so that a few extraneous reads won't remove an entire locus.  In plain english, it's keeping loci that have over 100 times
360#25	more  forward alternate reads than reverse alternate reads and 100 times more forward reference reads than reverse reference reads along with the reciprocal.
361#26
362#c#26	$mawk '!/#/' DP3g95p5maf05.fil2.vcf | wc -l
363#26	9491
364#26	That only removes a small proportion of loci, but these loci are likely to be paralogs, microbe contamination, or weird PCR chimeras.
365#27
366#27	The program SAMtools is a great way to visualize alignments right from the terminal.
367#27
368#c#27	$samtools tview BR_006-RG.bam reference.fasta -p E28188_L151
369#27
370#27	         11        21        31        41        51        61        71        81        91        101       111       121	 131
371#27	AATTCTCAGAGCTAGAGTGGGGACGGCAGTTGGTAGAGGGTACAGCAGTTCTAAAAACATGTAGAAATTTTCTCTTCAACTCGCTCCTACGGCCACAGCGTTCACTCCACATACACAAATTGTACACCAAAACATAGGAAAAG
372#27	...........S...........Y.K......S.........G.......K.........S............................Y........Y....W.........................M...G.........
373#27	..........................................G.......G......................................T.....
374#27	..........................................G.......T............................................
375#27	...........G...........T.T......C.........G.................C...............................
376#27	..........................................G.......T....................................G.......
377#27	..........................................G.......T............................................
378#27	...........G...........T.T......C.........G.................C..................................
379#27	..........................................G.......T............................................
380#27	...........G.................C............G.......G.................******........A...............T..
381#27	                                            ,,,,,,g,,,,,,,,,,,,,,,,,******,,,,,,,,a,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
382#27	                                                  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,,
383#27	                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
384#27	                                                  ,,,,,,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
385#27	                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
386#27	                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
387#27	                                                  g,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
388#27	                                                  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,,
389#27	                                                  ,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,,
390#27	                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
391#27	                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,a,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
392#27	                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
393#27	                                                  ,,,,,,,,,,c,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,
394#27	                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
395#27	                                                  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
396#27	                                                  g,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
397#27	                                                  g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,,
398#27
399#27	As you can see this is a mess.  There appear to be  over two haplotypes here.
400#27	For more info on how to use samtools tview press the question mark while you are in the window.
401#28
402#28	The next filter looks at the ratio of mapping qualities between reference and alternate alleles
403#c#28	$vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" DP3g95p5maf05.fil2.vcf > DP3g95p5maf05.fil3.vcf
404#28	The rationale here is that, again, because RADseq loci and alleles all should start from the same genomic location there should not be large discrepancy
405#28	between the mapping qualities of two alleles.
406#c#29	$mawk '!/#/' DP3g95p5maf05.fil3.vcf | wc -l
407#29	9229
408#29	This filters away less than 3% of the variants, but likely need to be filtered.  Let's take a look at one
409#c#30	$samtools tview BR_004-RG.bam reference.fasta -p E20_L173
410#30	1         11        21        31        41        51        61        71        81        91        101       111	121	  131
411#30	NAATTCATCTGTTGCAGGCAGCTCACACTTGCAGCCTCGGCTCGCACCAGCAGAGCAGCCGTAGAATACTTAGTTTAATAGAATGGCTTGGCATTTNNNNNNNNNNCATGAGGTTGTTATTCTCAGAAGACTAATCACAGACA
412#30	 .......Y.........YM....WS...Y....S...R....R....................................................C         ....................G................
413#30	 .......T.........TC....TG...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
414#30	 .......T.........TC.....G...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
415#30	 .......T.........TC....TG...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
416#30	 .......T.........TC.....G...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
417#30	 .......T.........TC.....G...C....G...A....A..................................                            ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,
418#30	 .......T.........TC....TG...C....G...A....A..................................
419#30	 .......T.........TC.....G...C....G...A....A..................................
420#30	 .......T.........TC.....G...C....G...A....A..................................
421#30	 .......T.........TC....TG...C....G...A....A..................................
422#30	 .......T.........TC....TG...C....G...A....A..................................
423#30	 ..........C....................................................................................C
424#30	 .......T.........TC.....G...C....G...A....A..................................
425#30	 ..........C....................................................................................C
426#30	 .......T.........TC.....G...C....G...A....A..................................
427#30	 ......GT.........TC.....G...C....G...A....A..................................
428#30	 .......T.........TC....TG...C....G...A....A..................................
429#30
430#30	There is a large amount of clipping going on here for the variant alleles likely why the mapping quality is low for them.  You can also see
431#30	that there are three different alleles present here.  Press SHIFT+L to scroll further down the alignment.  You can see that some of the polymorphism
432#30	is also link to a cut site variant.  All things that should be avoided.
433#31
434#31	Yet another filter that can be applied is whether or not their is a discrepancy in the properly paired status of for reads supporting reference or alternate alleles.
435#c#31	$vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s DP3g95p5maf05.fil3.vcf > DP3g95p5maf05.fil4.vcf
436#31	Since de novo assembly is not perfect, some loci will only have unpaired reads mapping to them.  This is not a problem.  The problem occurs when all the reads supporting the reference allele
437#31	are paired but not supporting the alternate allele.  That is indicative of a problem.
438#c#32	$mawk '!/#/' DP3g95p5maf05.fil4.vcf | wc -l
439#32	9166
440#32 Our loci count keeps dwindling, but our signal to noise ration keeps increasing.  Let's look at an example of what we filtered.
441#c#33	$samtools tview BR_006-RG.bam reference.fasta -p E4407_L138
442#33	This output doesn't paste well to the terminal, but you can see the clear discrepancy between mapping status and allele status.  This could be indicative of cut site polymorphism or paralogs.
443#34	The next filter we will apply is to look at the ration of locus quality score to depth
444#34	Heng Li found some interesting results about how quality score and locus depth are related to each other in real and spurious variant calls
445#34	See his preprint here (http://arxiv.org/pdf/1404.0929.pdf)
446#34	Also see this great blog post about it here (http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/)  I REALLY REALLY recommend following that blog.  Brad Chapman's group is really good.
447#34	In short, with whole genome samples, it was found that high coverage can lead to inflated locus quality scores.  Heng proposed that for read depths greater than the mean depth plus 2-3 times
448#34 the square root of mean depth that the quality score will be twice as large as the depth in real variants and below that value for false variants.
449#34	I actually found that this is a little too conservative for RADseq data, likely because the reads aren't randomly distributed across contigs.  I implement two filters based on this idea.
450#34 the first is removing any locus that has a quality score below 1/4 of the depth.
451#c#34	$vcffilter -f "QUAL / DP > 0.25" DP3g95p5maf05.fil4.vcf > DP3g95p5maf05.fil5.vcf
452#35	The second is more complicated.  The first step is to create a list of the depth of each locus
453#c#35	$cut -f8 DP3g95p5maf05.fil5.vcf | grep -oe "DP=[0-9]*" | sed -s 's/DP=//g' > DP3g95p5maf05.fil5.DEPTH
454#35	Who can talk us through this line of code?
455#36	The second step is to create a list of quality scores.
456#c#36	$mawk '!/#/' DP3g95p5maf05.fil5.vcf | cut -f1,2,6 > DP3g95p5maf05.fil5.vcf.loci.qual
457#37	Next step is to calculate the mean depth
458#c#37	$mawk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' DP3g95p5maf05.fil5.DEPTH
459#37	1952.82
460#38	Now the the mean plus 3X the square of the mean
461#c#38	$python -c "print int(1952+3*(1952**0.5))"
462#38	2084
463#39	Next we paste the depth and quality files together and find the loci above the cutoff that do not have quality scores 2 times the depth
464#c#39	$paste DP3g95p5maf05.fil5.vcf.loci.qual DP3g95p5maf05.fil5.DEPTH | mawk -v x=2084 '$4 > x' | mawk '$3 < 2 * $4' > DP3g95p5maf05.fil5.lowQDloci
465#40	Now we can remove those sites and recalculate the depth across loci with VCFtools
466#c#40	$vcftools --vcf DP3g95p5maf05.fil5.vcf --site-depth --exclude-positions DP3g95p5maf05.fil5.lowQDloci --out DP3g95p5maf05.fil5
467#41
468#41	Now let's take VCFtools output and cut it to only the depth scores
469#c#41	$cut -f3 DP3g95p5maf05.fil5.ldepth > DP3g95p5maf05.fil5.site.depth
470#42
471#42	Now let's calculate the average depth by dividing the above file by the number of individuals 31
472#c#42	$mawk '!/D/' DP3g95p5maf05.fil5.site.depth | mawk -v x=31 '{print $1/x}' > meandepthpersite
473#43	Let's plot the data as a histogram
474#c#43	$gnuplot << \EOF
475#c#43	$set terminal dumb size 120, 30
476#c#43	$set autoscale
477#c#43	$set xrange [10:150]
478#c#43	$unset label
479#c#43	$set title "Histogram of mean depth per site"
480#c#43	$set ylabel "Number of Occurrences"
481#c#43	$set xlabel "Mean Depth"
482#c#43	$binwidth=1
483#c#43	$bin(x,width)=width*floor(x/width) + binwidth/2.0
484#c#43	$set xtics 5
485#c#43	$plot 'meandepthpersite' using (bin($1,binwidth)):(1.0) smooth freq with boxes
486#c#43	$pause -1
487#c#43	$EOF
488#43
489#43	                                              Histogram of mean depth per site
490#43	  Number of Occurrences
491#43	    250 ++--+---+---+---+--+---+---+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+--++
492#43	        +   +   +   +   +  +   +   +   +   +   +   +   +   +'meandepthpersite' using (bin($1,binwidth)):(1.0)+****** +
493#43	        |               **    **                                                                                     |
494#43	        |               ***   **                                                                                     |
495#43	        |               ********                                                                                     |
496#43	    200 ++              ********* *                                                                                 ++
497#43	        |           ** ********** *                                                                                  |
498#43	        |           ************* **  *                                                                              |
499#43	        |           ****************  *  *                                                                           |
500#43	        |           *******************  **                                                                          |
501#43	    150 ++          ******************* ***                                                                         ++
502#43	        |           *************************                                                                        |
503#43	        |           ***************************                                                                      |
504#43	        |       *******************************                                                                      |
505#43	    100 ++      **********************************                                                                  ++
506#43	        |       **********************************                                                                   |
507#43	        |       ************************************** *                                                             |
508#43	        |       ************************************** * **                                                          |
509#43	        |       *********************************************                                                        |
510#43	     50 ++      ************************************************                                                    ++
511#43	        |       ************************************************   **   **                                           |
512#43	        |       ***********************************************************  **                                      |
513#43	        |       ****************************************************************  **  **                 **          |
514#43	        +   +   ************************************************************************************** ******+********
515#43	      0 ++--+---******************************************************************************************************
516#43	        10  15  20  25  30 35  40  45  50  55  60  65  70  75  80 85  90  95 100 105 110 115 120 125130 135 140 145 150
517#43                                                         Mean Depth
518#43
519#44	Loci that have high mean depth are indicative of either paralogs or multicopy loci.  Either way we want to remove them.  Here, I'd
520#44	remove all loci above a mean depth of 102.5.  Now we can combine both filters to produce another VCF file
521#c#44	$vcftools --vcf  DP3g95p5maf05.fil5.vcf --recode-INFO-all --out DP3g95p5maf05.FIL --max-meanDP 102.5 --exclude-positions DP3g95p5maf05.fil5.lowQDloci --recode
522#44	In the end, VCFtools kept 8417 out of a possible 9164 Sites.
523#44	BTW, I've also written a script to automate the filterings steps described in steps 23-44.  It's called dDocent_filters.  It will go through the filtering steps and
524#44	recode a log file for you for each of the steps, including the depth histogram.
525#c#44	$curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/dDocent_filters
526#c#44	$chmod +x dDocent_filters
527#c#44	$./dDocent_filters
528#44	This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth,
529#44	quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation.
530#44	The script assumes that loci and individuals with low call rates (or depth) have already been removed.
531#44
532#44	Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters
533#44
534#44	Usage is sh dDocent_filters.sh VCF_file Output_prefix
535#45
536#45	The next filter to apply is HWE.  Heng Li also found that HWE is another excellent filter to remove erroneous variant calls.
537#45	We don't want to apply it across the board, since population structure will create departures from HWE as well.  We need to apply this by population.
538#45 I've included a perl script written by Chris Hollenbeck, one of the PhD student's in my current lab that will do this for us.
539#c#45	$curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_hwe_by_pop.pl
540#c#45	$chmod +x filter_hwe_by_pop.pl
541#c#45	$./filter_hwe_by_pop.pl
542#45 Usage:
543#45     filter_hwe_by_pop.pl -v <vcffile> -p <popmap> [options]
544#45
545#45     Options: -v <vcffile> input vcf file -p <popmap> tab-separated file of
546#45     samples and population designations -h [hwe] minimum Hardy-Weinberg
547#45     p-value cutoff for SNPs -c [cutoff] proportion of all populations that a
548#45     locus can be below HWE cutoff without being filtered -o [out] name of
549#45     outfile
550#45
551#45 Options:
552#45     -v, --vcffile
553#45             VCF input file
554#45
555#45     -p, --popmap
556#45             File with names of individuals and population designations, one
557#45             per line
558#45
559#45     -h, --hwe
560#45             Minimum cutoff for Hardy-Weinberg p-value (for test as
561#45             implemented in vcftools) [Default: 0.001]
562#45
563#45     -c, --cutoff
564#45             Proportion of all populations that a locus can be below HWE
565#45             cutoff without being filtered. For example, choosing 0.5 will
566#45             filter SNPs that are below the p-value threshold in 50% or more
567#45             of the populations. [Default: 0.25]
568#45
569#45     -o, --out
570#45             Name of outfile, by vcftools conventions (will be named
571#45             X.recode.vcf)
572#46	Let's filter our SNPs by population specific HWE
573#46 First, we need to convert our variant calls to SNPs
574#46 To do this we will use another command from vcflib called vcfallelicprimatives
575#c#46	$vcfallelicprimitives DP3g95p5maf05.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf05.prim.vcf
576#46	This will decompose complex variant calls into phased SNP and INDEL genotypes and keep the INFO flags for loci and genotypes.
577#47	Next, we can feed this VCF file into VCFtools to remove indels.
578#c#47	$vcftools --vcf DP3g95p5maf05.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf05
579#47	We now have 8379 SNP calls in our new VCF.
580#48
581#48 Now, let's apply the HWE filter.
582#c#48	$perl filter_hwe_by_pop.pl -v SNP.DP3g95p5maf05.recode.vcf -p popmap -o SNP.DP3g95p5maf05.HWE -h 0.01
583#49
584#49 Processing population: BR (20 inds)
585#49 Processing population: WL (20 inds)
586#49 Outputting results of HWE test for filtered loci to 'filtered.hwe'
587#49 Kept 8176 of a possible 8379 loci (filtered 203 loci)
588#49
589#49 Note, I would not normally use such a high `-h` value.  It's purely for this example.
590#50
591#50 We have now created a thoroughly filtered VCF, and we should have confidence in these SNP calls.
592#50	However, our lab is currently developing one more script, called rad_haplotyper.
593#50	This tool takes a VCF file of SNPs and will parse through BAM files looking to link SNPs into haplotypes along paired reads.
594#c#50	$curl -L -O https://raw.githubusercontent.com/jpuritz/WinterSchool.2016/master/rad_haplotyper.pl
595#c#50	$chmod +x rad_haplotyper.pl
596#50	Note, this script requires several Perl libraries.  See the README at https://github.com/chollenbeck/rad_haplotyper
597#50	It has a lot of options, let's take a look
598#c#50	$./rad_haplotyper.pl
599#50	Usage:
600#50	    perl rad_haplotyper.pl -v <vcffile> -r <reference> [options]
601#50
602#50	    Options: -v <vcffile> input vcf file
603#50
604#50	             -r     <reference>             reference genome
605#50
606#50	             -s     [samples]               optionally specify an individual sample to be haplotyped
607#50
608#50	             -u     [snp_cutoff]            remove loci with more than a specified number of SNPs
609#50
610#50	             -h     [hap_cutoff]            remove loci with more than a specified number of haplotypes relative to SNPs
611#50
612#50	             -m     [miss_cutoff]           cutoff for proportion of missing data for loci to be included in the output
613#50
614#50	             -mp    [max_paralog_inds]              cutoff for excluding possible paralogs
615#50
616#50	             -ml    [max_low_cov_inds]              cutoff for excluding loci with low coverage or genotyping errors
617#50
618#50	             -d     [depth]                 sampling depth used by the algorithm to build haplotypes
619#50
620#50	             -g     [genepop]               genepop file for population output
621#50
622#50	             -p     [popmap]                population map for organizing Genepop file
623#50
624#50	             -t     [tsvfile]               tsv file for linkage map output
625#50
626#50	             -a     [imafile]               IMa file output
627#50
628#50	             -p1    [parent1]               first parent in the mapping cross
629#50
630#50	             -p2    [parent2]               second parent in the mapping cross
631#50
632#50	             -x     [threads]               number of threads to use for the analysis
633#50
634#50	             -n                             use indels
635#50
636#50	             -e                             debug
637#50
638#50	Options:
639#50	    -v, --vcffile
640#50	            VCF input file
641#50
642#50	    -r, --reference
643#50	            Reference genome (FASTA format)
644#50
645#50	    -s, --samples
646#50	            Individual samples to use in the analysis - can be used multiple
647#50	            times for multiple individuals [Default: All]
648#50
649#50	    -u, --cutoff
650#50	            Excludes loci with more than the specified number of SNPs
651#50	            [Default: No filter]
652#50
653#50	    -h, --hap_count
654#50	            Excludes loci with more than the specified number of haplotypes
655#50	            relative to number of SNPs. Excluding forces other than mutation
656#50	            (i.e. recombination) the maximum number of haplotypes should be
657#50	            one more than the number of SNPs at the locus. The value
658#50	            provided is the number of haplotypes allowed in excess of the
659#50	            number of SNPs, which allows that mechanisms other than mutation
660#50	            may have influenced the number of haplotypes in the population.
661#50	            [Default: 100]
662#50
663#50	    -x, --threads
664#50	            Run in parallel across individuals with a specified number of
665#50	            threads
666#50
667#50	    -n, --indels
668#50	            Includes indels that are the only polymorphism at the locus
669#50	            (tag)
670#50
671#50	    -d, --depth
672#50	            Specify a depth of sampling reads for building haplotypes
673#50	            [Default: 20]
674#50
675#50	    -m, --miss_cutoff
676#50	            Proportion of missing data cutoff for removing loci from the
677#50	            final output. For example, to keep only loci with successful
678#50	            haplotype builds in 95% of individuals, enter 0.95. [Default:
679#50	            0.9]
680#50
681#50	    -mp, --max_paralog_inds
682#50	            Count cutoff for removing loci that are possible paralogs from
683#50	            the final output. The value is the maximum allowable number of
684#50	            individuals with more than the expected number of haplotypes
685#50	            [Default: No filter]
686#50
687#50	    -ml, --max_low_cov_inds
688#50	            Count cutoff for removing loci with low coverage or genotyping
689#50	            errors from the final output. The value is the maximum allowable
690#50	            number of individuals with less than the expected number of
691#50	            haplotypes [Default: No filter]
692#50
693#50	    -g, --genepop
694#50	            Writes a genepop file using haplotypes. Must provide the name of
695#50	            the genepop file.
696#50
697#50	    -a, --ima
698#50	            Writes a IMa file using haplotypes. Must provide the name of the
699#50	            IMa file.
700#50
701#50	    -p, --popmap
702#50	            Tab-separated file of individuals and their population
703#50	            designation, one per line (required for Genepop output)
704#50
705#50	    -t, --tsvfile
706#50	            Writes a tsv file using haplotypes - for mapping crosses only.
707#50	            Must provide the name of the tsv file.
708#50
709#50	    -p1, --parent1
710#50	            Parent 1 of the mapping cross (must be specified if writing a
711#50	            tsv file)
712#50
713#50	    -p2, --parent2
714#50	            Parent 2 of the mapping cross (must be specified if writing a
715#50	            tsv file)
716#50
717#50	    -e, --debug
718#50	            Output extra logs for debugging purposes
719#50
720#51 We don't have enough time to go into depth with all these options and this tool is still under development.
721#51	It also take some substantial resources to run. I will simulate running this for you.
722#c#51	$#rad_haplotyper.pl -v SNP.DP3g95p5maf05.HWE.recode.vcf -x 40 -mp 1 -u 20 -ml 4 -n -r reference.fasta
723#51	Note, this will not actually run.  It needs all the BAM files to proceed.
724#51
725#51	Removed 0 loci (0 SNPs) with more than 20 SNPs at a locus
726#51	Building haplotypes for BR_024
727#51	Building haplotypes for BR_028
728#51	Building haplotypes for WL_054
729#51	Building haplotypes for BR_016
730#51	Building haplotypes for BR_009
731#51	Building haplotypes for BR_006
732#51	Building haplotypes for BR_041
733#51	Building haplotypes for BR_040
734#51	Building haplotypes for BR_046
735#51	Building haplotypes for BR_031
736#51	Building haplotypes for BR_025
737#51	Building haplotypes for BR_002
738#51	Building haplotypes for WL_058
739#51	Building haplotypes for WL_057
740#51	Building haplotypes for WL_061
741#51	Building haplotypes for WL_069
742#51	Building haplotypes for WL_070
743#51	Building haplotypes for BR_048
744#51	Building haplotypes for WL_031
745#51	Building haplotypes for WL_056
746#51	Building haplotypes for BR_047
747#51	Building haplotypes for WL_079
748#51	Building haplotypes for WL_080
749#51	Building haplotypes for WL_032
750#51	Building haplotypes for WL_071
751#51	Building haplotypes for WL_081
752#51	Building haplotypes for BR_004
753#51	Building haplotypes for BR_021
754#51	Building haplotypes for BR_015
755#51	Building haplotypes for BR_043
756#51	Building haplotypes for WL_066
757#51	Filtered 26 loci below missing data cutoff
758#51	Filtered 66 possible paralogs
759#51	Filtered 17 loci with low coverage or genotyping errors
760#51	Filtered 0 loci with an excess of haplotypes
761#52
762#52	The script found another 109 loci to remove from our file.  Besides this output to the terminal, the script outputs a file called stats.out
763#52	Let's symlink that file to our current directory.
764#c#52	$head stats.out
765#52
766#52	Locus		Sites	Haplotypes	Inds_Haplotyped	Total_Inds	Prop_Haplotyped	Status	Poss_Paralog	Low_Cov/Geno_Err	Miss_Geno	Comment
767#52	E10001_L101	1		2			30				31			0.968			PASSED	0				0					1
768#52	E10003_L101	7		9			30				31			0.968			PASSED	1				0					0
769#52	E10004_L101	-		-			-				-			-				FILTERED0				0					1			Complex
770#52	E10008_L101	1		2			30				31			0.968			PASSED	0				0					1
771#52	E10013_L142	3		6			30				31			0.968			PASSED	0				0					1
772#52	E10014_L117	2		3			31				31			1.000			PASSED	0				0					0
773#52	E10024_L101	1		2			31				31			1.000			PASSED	0				0					0
774#52	E10029_L101	1		2			31				31			1.000			PASSED	0				0					0
775#53
776#53	We can use this file to create a list of loci to filter
777#c#53	$grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove
778#54	Now that we have the list we can parse through the VCF file and remove the bad RAD loci
779#54	I've made a simple script to do this remove.bad.hap.loci.sh
780#c#54	$curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/remove.bad.hap.loci.sh
781#c#54	$chmod +x remove.bad.hap.loci.sh
782#c#54	$./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf05.HWE.recode.vcf
783#54	This produces a FINAL FINAL FINAL filtered VCF file	SNP.DP3g95p5maf05.HWE.filtered.vcf
784#c#55	$mawk '!/#/' SNP.DP3g95p5maf05.HWE.filtered.vcf | wc -l
785#55	We're left with 7,666 SNPs!
786#56	How many possible errors?
787#c#56	$ErrorCount.sh SNP.DP3g95p5maf05.HWE.filtered.vcf
788#56	Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0
789#56	Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0
790#56	Potential genotyping errors from genotypes from only 3 reads range from 302 to 1014.72
791#56	Potential genotyping errors from genotypes from only 4 reads range from 162 to 822.232
792#56	Potential genotyping errors from genotypes from only 5 reads range from 88 to 669
793#56	31 number of individuals and 7666 equals 237646 total genotypes
794#56	Total genotypes not counting missing data 237081
795#56	Total potential error rate is between 0.00232831816974 and 0.0105700245908
796#56	SCORCHED EARTH SCENARIO
797#56	WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS?????
798#56	The total SCORCHED EARTH error rate is 0.0330857386294.
799#56
800#56
801#56	SCORCHED EARTH SCENARIO
802#56	WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS?????
803#56	The total SCORCHED EARTH error rate is 0.0286723046571.
804#57
805#57	Congrats!  You've finished the Filtering Tutorial
806
807
808#!/bin/bash
809if which FilterTut &>/dev/null; then
810    LOC=$(which FilterTut 2>/dev/null)
811else
812	LOC="./FilterTut"
813fi
814if [[ -z "$1" ]]; then
815head -13 $LOC
816else
817PATTERN=#$1[[:blank:]]
818PATTERN2=#c#$1
819PATTERN4=^$1
820grep $PATTERN $LOC | sed 's/'$PATTERN2'\t/ /g' | sed 's/#'$1'\s/ /g'
821fi
822