#dDocent Filtering Tutorial Designed by Jon Puritz # GOALS 1. To learn how to use VCFtools to filter a VCF file for missing data, genotype depth, locus quality score, minor allele frequency, and genotype call depth 2. To learn how to use vcflib to filter FreeBayes VCF files generated with RAD data 3. To filter a VCF file for HWE within populations 4. How to decompose a VCF into SNPs and INDELs and 5. How to use a haplotyping script to further filter SNPs for paralogs and genotyping errors. *** # Tutorial Welcome to the SNP filtering exercise. For the first part of the exercise, the filtering steps should work on almost any VCF file. For the second part of the exercise, we are going to assume you are working with a VCF file that was generated by FreeBayes. Note that other SNP callers can be configured to include the same annotations. Let's find our way back to your original working directory and make a new filtering directory ```bash mkdir filtering cd filtering ``` Now, let's download some data to look at. ```bash curl -L -o data.zip https://www.dropbox.com/sh/bf9jxviaoq57s5v/AAD2Kv5SPpHlZ7LC7sBz4va8a?dl=1 unzip data.zip ll ``` ```bash total 165620 -rwxr--r--. 1 jpuritz users 109633 Mar 6 14:56 BR_004-RG.bam -rwxr--r--. 1 jpuritz users 247496 Mar 6 14:57 BR_004-RG.bam.bai -rwxr--r--. 1 jpuritz users 120045 Mar 6 15:14 BR_006-RG.bam -rwxr--r--. 1 jpuritz users 247712 Mar 6 15:14 BR_006-RG.bam.bai -rw-r--r--. 1 jpuritz users 78979977 Mar 6 16:15 data.zip drwxr-xr-x. 2 jpuritz users 21 Mar 6 16:16 __MACOSX -rwxr--r--. 1 jpuritz users 399 Mar 6 15:08 popmap -rwxr--r--. 1 jpuritz users 68264393 Mar 6 13:40 raw.vcf.gz -rwxr--r--. 1 jpuritz users 6804314 Mar 6 14:49 reference.fasta -rwxr--r--. 1 jpuritz users 1085387 Mar 6 14:49 reference.fasta.amb -rwxr--r--. 1 jpuritz users 1068884 Mar 6 14:49 reference.fasta.ann -rwxr--r--. 1 jpuritz users 6379720 Mar 6 14:49 reference.fasta.bwt -rwxr--r--. 1 jpuritz users 353544 Mar 6 14:49 reference.fasta.clstr -rwxr--r--. 1 jpuritz users 976388 Mar 6 14:49 reference.fasta.fai -rwxr--r--. 1 jpuritz users 1594913 Mar 6 14:49 reference.fasta.pac -rwxr--r--. 1 jpuritz users 3189872 Mar 6 14:49 reference.fasta.sa -rwxr--r--. 1 jpuritz users 137209 Mar 6 14:30 stats.out ``` To start, we are going to use the program VCFtools (http://vcftools.sourceforge.net) to filter our vcf file. This program has a binary executable and has several perl scripts as well that are useful for filtering. I find it much more useful to use version 0.1.11, since it has more useful filtering commands (I think). Let's load that version This raw.vcf file is going to have a lot of erroneous variant calls and a lot of variants that are only present in one individual. To make this file more manageable, let's start by applying three step filter. We are going to only keep variants that have been successfully genotyped in 50% of individuals, a minimum quality score of 30, and a minor allele count of 3. ``` vcftools --gzvcf raw.vcf.gz --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3 ``` In this code, we call vcftools, feed it a vcf file after the `--vcf` flag, `--max-missing 0.5` tells it to filter genotypes called below 50% (across all individuals) the `--mac 3` flag tells it to filter SNPs that have a minor allele count less than 3. This is relative to genotypes, so it has to be called in at least 1 homozygote and 1 heterozygote or 3 heterozygotes. The `--recode` flag tells the program to write a new vcf file with the filters, `--recode-INFO-all` keeps all the INFO flags from the old vcf file in the new one. Lastly, `--out` designates the name of the output The output will scroll through a lot of lines, but should end like: ``` After filtering, kept 40 out of 40 Individuals After filtering, kept 78434 out of a possible 147540 Sites Outputting VCF file... Done Run Time = 40.00 seconds ``` Those two simple filters got rid of 50% of the data and will make the next filtering steps run much faster. We now have a filtered VCF called raw.g5mac3.recode.vcf. There is also a logfile generated called raw.g5mac3.log The next filter we will apply is a minimum depth for a genotype call and a minimum mean depth ```bash vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3 ``` This command will recode genotypes that have less than 3 reads. I'll give you a second to take a deep breath. Yes, we are keeping genotypes with as few as 3 reads. We talked about this in the lecture portion of this course, but the short answer is that sophisticated multisample variant callers like FreeBayes and GATK can confidently call genotypes with few reads because variants are assessed across all samples simultaneously. So, the genotype is based on three reads AND prior information from all reads from all individuals. Relax. We will do plenty of other filtering steps! Don't believe me do you? I've made a script to help evaluate the potential errors. ```bash curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/ErrorCount.sh chmod +x ErrorCount.sh ./ErrorCount.sh raw.g5mac3dp3.recode.vcf ``` ```bash This script counts the number of potential genotyping errors due to low read depth It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability. Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0 Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0 Potential genotyping errors from genotypes from only 3 reads range from 15986 to 53714.22 Potential genotyping errors from genotypes from only 4 reads range from 6230 to 31502.04 Potential genotyping errors from genotypes from only 5 reads range from 2493 to 18914 40 number of individuals and 78434 equals 3137360 total genotypes Total genotypes not counting missing data 2380094 Total potential error rate is between 0.0103815227466 and 0.0437504821238 SCORCHED EARTH SCENARIO WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS????? The total SCORCHED EARTH error rate is 0.129149100834. ``` Right now, the maximum error rate for our VCF file because of genotypes less than 5 reads is less than 5%. See, nothing to worry about. The next step is to get rid of individuals that did not sequence well. We can do this by assessing individual levels of missing data. ```bash vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv ``` This will create an output called out.imiss. Let's examine it. ```bash cat out.imiss ``` ``` INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS BR_002 78434 0 13063 0.166548 BR_004 78434 0 16084 0.205064 BR_006 78434 0 25029 0.319109 BR_009 78434 0 30481 0.38862 BR_013 78434 0 69317 0.883762 BR_015 78434 0 8861 0.112974 BR_016 78434 0 29789 0.379797 BR_021 78434 0 17422 0.222123 BR_023 78434 0 43913 0.559872 BR_024 78434 0 24220 0.308795 BR_025 78434 0 21998 0.280465 BR_028 78434 0 26786 0.34151 BR_030 78434 0 74724 0.952699 BR_031 78434 0 26488 0.337711 BR_040 78434 0 19492 0.248515 BR_041 78434 0 17107 0.218107 BR_043 78434 0 16384 0.208889 BR_046 78434 0 28770 0.366805 BR_047 78434 0 13258 0.169034 BR_048 78434 0 24505 0.312428 WL_031 78434 0 22566 0.287707 WL_032 78434 0 22604 0.288191 WL_054 78434 0 32902 0.419486 WL_056 78434 0 34106 0.434837 WL_057 78434 0 37556 0.478823 WL_058 78434 0 31448 0.400949 WL_061 78434 0 35671 0.45479 WL_064 78434 0 47816 0.609634 WL_066 78434 0 10062 0.128286 WL_067 78434 0 47940 0.611215 WL_069 78434 0 38260 0.487799 WL_070 78434 0 21188 0.270138 WL_071 78434 0 16692 0.212816 WL_072 78434 0 46347 0.590904 WL_076 78434 0 78178 0.996736 WL_077 78434 0 55193 0.703687 WL_078 78434 0 54400 0.693577 WL_079 78434 0 19457 0.248068 WL_080 78434 0 30076 0.383456 WL_081 78434 0 30334 0.386746 ``` You can see that some individuals have as high as 99.6% missing data. We definitely want to filter those out. Let's take a look at a histogram ```bash mawk '!/IN/' out.imiss | cut -f5 > totalmissing gnuplot << \EOF set terminal dumb size 120, 30 set autoscale unset label set title "Histogram of % missing data per individual" set ylabel "Number of Occurrences" set xlabel "% of missing data" #set yr [0:100000] binwidth=0.01 bin(x,width)=width*floor(x/width) + binwidth/2.0 plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes pause -1 EOF ``` ```bash Histogram of % missing data per individual Number of Occurrences 3 ++----------+---------***---------***-----------+------------+-----------+-----------+-----------+----------++ + + * * * * + 'totalmissing' using (bin($1,binwidth)):(1.0) ****** + | * * * * | | * * * * | | * * * * | | * * * * | 2.5 ++ * * * * ++ | * * * * | | * * * * | | * * * * | | * * * * | | * * * * | 2 ++ *************** * **** * * ++ | * * * ** * * ** * * * | | * * * ** * * ** * * * | | * * * ** * * ** * * * | | * * * ** * * ** * * * | | * * * ** * * ** * * * | 1.5 ++ * * * ** * * ** * * * ++ | * * * ** * * ** * * * | | * * * ** * * ** * * * | | * * * ** * * ** * * * | | * * * ** * * ** * * * | + * * +* ** * * ** * * * + + + + + + 1 +************************************************************************************************************* 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 % of missing data ``` Most of the individuals have less than 0.5 missing data. That is probably a good cutoff to use for the moment. Now we need to create a list of individuals with more than 50% missing data. Anyone have any ideas? We can use mawk to do it. ```bash mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv ``` Take a moment to think about what this code is doing. Now that we have a list of individuals to remove, we can feed that directly into VCFtools for filtering. ```bash vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm ``` As you can see from the output, this removed 9 individuals. I've included a script called filter_missing_ind.sh that will automate this process for you in the future. Try it out. ```bash curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_missing_ind.sh chmod +x filter_missing_ind.sh ./filter_missing_ind.sh raw.g5mac3dp3.recode.vcf DP3g95maf05 ``` The command always follows the structure of filter_missing_ind.sh vcf_to_filter name_prefix_for_new_vcf The script prints out a histogram like the one above and also calculates the 85% for missing data. Enter "no" Now that we have removed poor coverage individuals, we can restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes ```bash vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 20 ``` This leaves us with about 12,754 loci in our filtered vcf file. This applied a genotype call rate (95%) across all individuals. With two localities, this is sufficient, but when you have multiple localities being sampled You are also going to want to filter by a population specific call rate. VCFtools won't calculate this directly, but it is an easy workaround. First we need a file to define localities (populations). Most programs want the file to have two tab separated columns. First with the sample name, second with population assignment. I've already made one for this exercise. ``` cat popmap ``` ``` BR_002 BR BR_004 BR BR_006 BR BR_009 BR BR_013 BR BR_015 BR BR_016 BR BR_021 BR BR_023 BR BR_024 BR BR_025 BR BR_028 BR BR_030 BR BR_031 BR BR_040 BR BR_041 BR BR_043 BR BR_046 BR BR_047 BR BR_048 BR WL_031 WL WL_032 WL WL_054 WL WL_056 WL WL_057 WL WL_058 WL WL_061 WL WL_064 WL WL_066 WL WL_067 WL WL_069 WL WL_070 WL WL_071 WL WL_072 WL WL_076 WL WL_077 WL WL_078 WL WL_079 WL WL_080 WL WL_081 WL ``` Now we need to create two lists that have just the individual names for each population ```bash mawk '$2 == "BR"' popmap > 1.keep && mawk '$2 == "WL"' popmap > 2.keep ``` The above line demonstrates the use of && to simultaneous execute two tasks. Next, we use VCFtools to estimate missing data for loci in each population ```bash vcftools --vcf DP3g95maf05.recode.vcf --keep 1.keep --missing-site --out 1 vcftools --vcf DP3g95maf05.recode.vcf --keep 2.keep --missing-site --out 2 ``` This will generate files named 1.lmiss and 2.lmiss They follow this format ``` head -3 1.lmiss ``` ```bash CHR POS N_DATA N_GENOTYPE_FILTERED N_MISS F_MISS E1_L101 9 34 0 0 0 E1_L101 15 34 0 0 0 ``` I added extra tabs to make this easier to read, but what we are interested in is that last column with is the percentage of missing data for that locus. We can combine the two files and make a list of loci about the threshold of 10% missing data to remove. Note this is double the overall rate of missing data. ```bash cat 1.lmiss 2.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci ``` Who can walk us through that line of code? We then feed this file back into VCFtools to remove any of the loci ```bash vcftools --vcf DP3g95maf05.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf05 ``` Again, we only had two populations so our overall filter caught all of these. However, this will not be the case in multi-locality studies I also have made a script to automate this process as well. It's called pop_missing_filter.sh Executing it with no parameters will give you the usage. ```bash curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/pop_missing_filter.sh chmod +x pop_missing_filter.sh ./pop_missing_filter.sh ``` ``` Usage is pop_missing_filter vcffile popmap percent_missing_per_pop number_of_pops_for_cutoff name_for_output ``` *** **From this point forward, the filtering steps assume that the vcf file was generated by FreeBayes** Note that other SNP callers can be configured to include the similar annotations. FreeBayes outputs a lot of information about a locus in the VCF file, using this information and the properties of RADseq, we add some sophisticated filters to the data. Let's take a look at the header of our VCF file and take a quick look at all the information. ```bash mawk '/#/' DP3g95maf05.recode.vcf ``` This will output several lines of INFO tags, I have highlighted a few below: ``` INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= INFO= FORMAT= FORMAT= FORMAT= FORMAT= FORMAT= FORMAT= FORMAT= FORMAT= ``` The first filter we will apply will be on allele balance. Allele balance is: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous Because RADseq targets specific locations of the genome, we expect that the allele balance in our data (for real loci) should be close to 0.5 We can use the vcffilter program from vcflib. (https://github.com/ekg/vcflib) Typing it with no parameters will give you the usage. ```bash vcffilter ``` ``` usage: vcffilter [options] options: -f, --info-filter specifies a filter to apply to the info fields of records, removes alleles which do not pass the filter -g, --genotype-filter specifies a filter to apply to the genotype fields of records -k, --keep-info used in conjunction with '-g', keeps variant info, but removes genotype -s, --filter-sites filter entire records, not just alleles -t, --tag-pass tag vcf records as positively filtered with this tag, print all records -F, --tag-fail tag vcf records as negatively filtered with this tag, print all records -A, --append-filter append the existing filter tag, don't just replace it -a, --allele-tag apply -t on a per-allele basis. adds or sets the corresponding INFO field tag -v, --invert inverts the filter, e.g. grep -v -o, --or use logical OR instead of AND to combine filters -r, --region specify a region on which to target the filtering, requires a BGZF compressed file which has been indexed with tabix. any number of regions may be specified. ``` Let's use our first filter ```bash vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95p5maf05.recode.vcf > DP3g95p5maf05.fil1.vcf ``` vcffilter works with simple conditional statements, so this filters out loci with an allele balance below 0.25 and above 0.75. However, it does include those that are close to zero. The last condition is to catch loci that are fixed variants (all individuals are homozygous for one of the two variants). The `-s` tells the filter to apply to sites, not just alleles To see how many loci are now in the VCF file, you could feed it into VCFtools or you can just use a simple mawk statement ```bash mawk '!/#/' DP3g95p5maf05.recode.vcf | wc -l 12754 ``` ```bash mawk '!/#/' DP3g95p5maf05.fil1.vcf | wc -l 9678 ``` You'll notice that we've filtered a lot of loci. In my experience though, I find that most of these tend to be errors of some kind. However, this will be data dependent. I encourage you to explore your own data sets. The next filter we will apply filters out sites that have reads from both strands. For GWAS and even RNAseq, this can be a good thing. Unless you are using super small genomic fragment or really long reads (MiSeq). A SNP should be covered only by forward or only reverse reads. ```bash vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s DP3g95p5maf05.fil1.vcf > DP3g95p5maf05.fil2.vcf ``` The filter is based on proportions, so that a few extraneous reads won't remove an entire locus. In plain english, it's keeping loci that have over 100 times more forward alternate reads than reverse alternate reads and 100 times more forward reference reads than reverse reference reads along with the reciprocal. ```bash mawk '!/#/' DP3g95p5maf05.fil2.vcf | wc -l 9491 ``` That only removes a small proportion of loci, but these loci are likely to be paralogs, microbe contamination, or weird PCR chimeras. The program SAMtools is a great way to visualize alignments right from the terminal. ```bash samtools tview BR_006-RG.bam reference.fasta -p E28188_L151 ``` ```bash 11 21 31 41 51 61 71 81 91 101 111 121 131 AATTCTCAGAGCTAGAGTGGGGACGGCAGTTGGTAGAGGGTACAGCAGTTCTAAAAACATGTAGAAATTTTCTCTTCAACTCGCTCCTACGGCCACAGCGTTCACTCCACATACACAAATTGTACACCAAAACATAGGAAAAG ...........S...........Y.K......S.........G.......K.........S............................Y........Y....W.........................M...G......... ..........................................G.......G......................................T..... ..........................................G.......T............................................ ...........G...........T.T......C.........G.................C............................... ..........................................G.......T....................................G....... ..........................................G.......T............................................ ...........G...........T.T......C.........G.................C.................................. ..........................................G.......T............................................ ...........G.................C............G.......G.................******........A...............T.. ,,,,,,g,,,,,,,,,,,,,,,,,******,,,,,,,,a,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,, t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, ,,,,,,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, g,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,, ,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,, t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,a,,,,,,,,,,,,,,,c,,,g,,,,,,,,, g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, ,,,,,,,,,,c,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, g,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, ``` As you can see this is a mess. There appear to be over two haplotypes here. For more info on how to use samtools tview press the question mark while you are in the window. The next filter looks at the ratio of mapping qualities between reference and alternate alleles ```bash vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" DP3g95p5maf05.fil2.vcf > DP3g95p5maf05.fil3.vcf ``` The rationale here is that, again, because RADseq loci and alleles all should start from the same genomic location there should not be large discrepancy between the mapping qualities of two alleles. ```bash mawk '!/#/' DP3g95p5maf05.fil3.vcf | wc -l 9229 ``` This filters away less than 3% of the variants, but likely need to be filtered. Let's take a look at one ```bash samtools tview BR_004-RG.bam reference.fasta -p E20_L173 ``` ```bash 1 11 21 31 41 51 61 71 81 91 101 111 121 131 NAATTCATCTGTTGCAGGCAGCTCACACTTGCAGCCTCGGCTCGCACCAGCAGAGCAGCCGTAGAATACTTAGTTTAATAGAATGGCTTGGCATTTNNNNNNNNNNCATGAGGTTGTTATTCTCAGAAGACTAATCACAGACA .......Y.........YM....WS...Y....S...R....R....................................................C ....................G................ .......T.........TC....TG...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, .......T.........TC.....G...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, .......T.........TC....TG...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, .......T.........TC.....G...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, .......T.........TC.....G...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, .......T.........TC....TG...C....G...A....A.................................. .......T.........TC.....G...C....G...A....A.................................. .......T.........TC.....G...C....G...A....A.................................. .......T.........TC....TG...C....G...A....A.................................. .......T.........TC....TG...C....G...A....A.................................. ..........C....................................................................................C .......T.........TC.....G...C....G...A....A.................................. ..........C....................................................................................C .......T.........TC.....G...C....G...A....A.................................. ......GT.........TC.....G...C....G...A....A.................................. .......T.........TC....TG...C....G...A....A.................................. ``` There is a large amount of clipping going on here for the variant alleles likely why the mapping quality is low for them. You can also see that there are three different alleles present here. Press SHIFT+L to scroll further down the alignment. You can see that some of the polymorphism is also link to a cut site variant. All things that should be avoided. Yet another filter that can be applied is whether or not their is a discrepancy in the properly paired status of for reads supporting reference or alternate alleles. ```bash vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s DP3g95p5maf05.fil3.vcf > DP3g95p5maf05.fil4.vcf ``` Since de novo assembly is not perfect, some loci will only have unpaired reads mapping to them. This is not a problem. The problem occurs when all the reads supporting the reference allele are paired but not supporting the alternate allele. That is indicative of a problem. ```bash mawk '!/#/' DP3g95p5maf05.fil4.vcf | wc -l 9166 ``` Our loci count keeps dwindling, but our signal to noise ration keeps increasing. Let's look at an example of what we filtered. ```bash samtools tview BR_006-RG.bam reference.fasta -p E4407_L138 ``` This output doesn't paste well to the terminal, but you can see the clear discrepancy between mapping status and allele status. This could be indicative of cut site polymorphism or paralogs. The next filter we will apply is to look at the ration of locus quality score to depth Heng Li found some interesting results about how quality score and locus depth are related to each other in real and spurious variant calls See his preprint here (http://arxiv.org/pdf/1404.0929.pdf) Also see this great blog post about it here (http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/) I **REALLY** recommend following that blog. Brad Chapman's group is really good. In short, with whole genome samples, it was found that high coverage can lead to inflated locus quality scores. Heng proposed that for read depths greater than the mean depth plus 2-3 times the square root of mean depth that the quality score will be twice as large as the depth in real variants and below that value for false variants. I actually found that this is a little too conservative for RADseq data, likely because the reads aren't randomly distributed across contigs. I implement two filters based on this idea. the first is removing any locus that has a quality score below 1/4 of the depth. ```bash vcffilter -f "QUAL / DP > 0.25" DP3g95p5maf05.fil4.vcf > DP3g95p5maf05.fil5.vcf ``` The second is more complicated. The first step is to create a list of the depth of each locus ```bash cut -f8 DP3g95p5maf05.fil5.vcf | grep -oe "DP=[0-9]*" | sed -s 's/DP=//g' > DP3g95p5maf05.fil5.DEPTH ``` The second step is to create a list of quality scores. ```bash mawk '!/#/' DP3g95p5maf05.fil5.vcf | cut -f1,2,6 > DP3g95p5maf05.fil5.vcf.loci.qual ``` Next step is to calculate the mean depth ```bash mawk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' DP3g95p5maf05.fil5.DEPTH 1952.82 ``` Now the the mean plus 3X the square of the mean ```bash python -c "print int(1952+3*(1952**0.5))" 2084 ``` Next we paste the depth and quality files together and find the loci above the cutoff that do not have quality scores 2 times the depth ```bash paste DP3g95p5maf05.fil5.vcf.loci.qual DP3g95p5maf05.fil5.DEPTH | mawk -v x=2084 '$4 > x' | mawk '$3 < 2 * $4' > DP3g95p5maf05.fil5.lowQDloci ``` Now we can remove those sites and recalculate the depth across loci with VCFtools ```bash vcftools --vcf DP3g95p5maf05.fil5.vcf --site-depth --exclude-positions DP3g95p5maf05.fil5.lowQDloci --out DP3g95p5maf05.fil5 ``` Now let's take VCFtools output and cut it to only the depth scores ```bash cut -f3 DP3g95p5maf05.fil5.ldepth > DP3g95p5maf05.fil5.site.depth ``` Now let's calculate the average depth by dividing the above file by the number of individuals 31 ```bash mawk '!/D/' DP3g95p5maf05.fil5.site.depth | mawk -v x=31 '{print $1/x}' > meandepthpersite ``` Let's plot the data as a histogram ```bash gnuplot << \EOF set terminal dumb size 120, 30 set autoscale set xrange [10:150] unset label set title "Histogram of mean depth per site" set ylabel "Number of Occurrences" set xlabel "Mean Depth" binwidth=1 bin(x,width)=width*floor(x/width) + binwidth/2.0 set xtics 5 plot 'meandepthpersite' using (bin($1,binwidth)):(1.0) smooth freq with boxes pause -1 EOF ``` ```bash Histogram of mean depth per site Number of Occurrences 250 ++--+---+---+---+--+---+---+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+--++ + + + + + + + + + + + + + +'meandepthpersite' using (bin($1,binwidth)):(1.0)+****** + | ** ** | | *** ** | | ******** | 200 ++ ********* * ++ | ** ********** * | | ************* ** * | | **************** * * | | ******************* ** | 150 ++ ******************* *** ++ | ************************* | | *************************** | | ******************************* | 100 ++ ********************************** ++ | ********************************** | | ************************************** * | | ************************************** * ** | | ********************************************* | 50 ++ ************************************************ ++ | ************************************************ ** ** | | *********************************************************** ** | | **************************************************************** ** ** ** | + + ************************************************************************************** ******+******** 0 ++--+---****************************************************************************************************** 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125130 135 140 145 150 Mean Depth ``` Loci that have high mean depth are indicative of either paralogs or multicopy loci. Either way we want to remove them. Here, I'd remove all loci above a mean depth of 102.5. Now we can combine both filters to produce another VCF file ```bash vcftools --vcf DP3g95p5maf05.fil5.vcf --recode-INFO-all --out DP3g95p5maf05.FIL --max-meanDP 102.5 --exclude-positions DP3g95p5maf05.fil5.lowQDloci --recode ``` In the end, VCFtools kept 8417 out of a possible 9164 Sites. BTW, I've also written a script to automate the filterings steps described in steps 23-44. It's called dDocent_filters. It will go through the filtering steps and recode a log file for you for each of the steps, including the depth histogram. ```bash curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/dDocent_filters chmod +x dDocent_filters ./dDocent_filters ``` ``` This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth, quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation. The script assumes that loci and individuals with low call rates (or depth) have already been removed. Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters Usage is sh FB_filters.sh VCF_file Output_prefix ``` The next filter to apply is HWE. Heng Li also found that HWE is another excellent filter to remove erroneous variant calls. We don't want to apply it across the board, since population structure will create departures from HWE as well. We need to apply this by population. I've included a perl script written by Chris Hollenbeck, one of the PhD student's in my current lab that will do this for us. ```bash curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_hwe_by_pop.pl chmod +x filter_hwe_by_pop.pl ``` ```bash ./filter_hwe_by_pop.pl Usage: filter_hwe_by_pop.pl -v -p [options] Options: -v input vcf file -p tab-separated file of samples and population designations -h [hwe] minimum Hardy-Weinberg p-value cutoff for SNPs -c [cutoff] proportion of all populations that a locus can be below HWE cutoff without being filtered -o [out] name of outfile Options: -v, --vcffile VCF input file -p, --popmap File with names of individuals and population designations, one per line -h, --hwe Minimum cutoff for Hardy-Weinberg p-value (for test as implemented in vcftools) [Default: 0.001] -c, --cutoff Proportion of all populations that a locus can be below HWE cutoff without being filtered. For example, choosing 0.5 will filter SNPs that are below the p-value threshold in 50% or more of the populations. [Default: 0.25] -o, --out Name of outfile, by vcftools conventions (will be named X.recode.vcf) ``` Let's filter our SNPs by population specific HWE First, we need to convert our variant calls to SNPs To do this we will use another command from vcflib called vcfallelicprimatives ```bash vcfallelicprimitives DP3g95p5maf05.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf05.prim.vcf ``` This will decompose complex variant calls into phased SNP and INDEL genotypes and keep the INFO flags for loci and genotypes. Next, we can feed this VCF file into VCFtools to remove indels. ```bash vcftools --vcf DP3g95p5maf05.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf05 ``` We now have 8379 SNP calls in our new VCF. Now, let's apply the HWE filter ```bash ./filter_hwe_by_pop.pl -v SNP.DP3g95p5maf05.recode.vcf -p popmap -o SNP.DP3g95p5maf05.HWE -h 0.01 ``` ```bash Processing population: BR (20 inds) Processing population: WL (20 inds) Outputting results of HWE test for filtered loci to 'filtered.hwe' Kept 8176 of a possible 8379 loci (filtered 203 loci) ``` *Note, I would not normally use such a high `-h` value. It's purely for this example. Typically, errors would have a low p-value and would be present in many populations.* We have now created a thoroughly filtered VCF, and we should have confidence in these SNP calls. However, our lab is currently developing one more script, called rad_haplotyper. This tool takes a VCF file of SNPs and will parse through BAM files looking to link SNPs into haplotypes along paired reads. ```bash curl -L -O https://raw.githubusercontent.com/jpuritz/WinterSchool.2016/master/rad_haplotyper.pl chmod +x rad_haplotyper.pl ``` Note, this script requires several Perl libraries. See the README [here](https://github.com/jpuritz/NTU-Workshop/blob/master/scripts/README.md) It has a lot of options, let's take a look ```bash perl rad_haplotyper.pl Usage: perl rad_haplotyper.pl -v -r [options] Options: -v input vcf file -r reference genome -s [samples] optionally specify an individual sample to be haplotyped -u [snp_cutoff] remove loci with more than a specified number of SNPs -h [hap_cutoff] remove loci with more than a specified number of haplotypes relative to SNPs -m [miss_cutoff] cutoff for proportion of missing data for loci to be included in the output -mp [max_paralog_inds] cutoff for excluding possible paralogs -ml [max_low_cov_inds] cutoff for excluding loci with low coverage or genotyping errors -d [depth] sampling depth used by the algorithm to build haplotypes -g [genepop] genepop file for population output -p [popmap] population map for organizing Genepop file -t [tsvfile] tsv file for linkage map output -a [imafile] IMa file output -p1 [parent1] first parent in the mapping cross -p2 [parent2] second parent in the mapping cross -x [threads] number of threads to use for the analysis -n use indels -e debug Options: -v, --vcffile VCF input file -r, --reference Reference genome (FASTA format) -s, --samples Individual samples to use in the analysis - can be used multiple times for multiple individuals [Default: All] -u, --cutoff Excludes loci with more than the specified number of SNPs [Default: No filter] -h, --hap_count Excludes loci with more than the specified number of haplotypes relative to number of SNPs. Excluding forces other than mutation (i.e. recombination) the maximum number of haplotypes should be one more than the number of SNPs at the locus. The value provided is the number of haplotypes allowed in excess of the number of SNPs, which allows that mechanisms other than mutation may have influenced the number of haplotypes in the population. [Default: 100] -x, --threads Run in parallel across individuals with a specified number of threads -n, --indels Includes indels that are the only polymorphism at the locus (tag) -d, --depth Specify a depth of sampling reads for building haplotypes [Default: 20] -m, --miss_cutoff Proportion of missing data cutoff for removing loci from the final output. For example, to keep only loci with successful haplotype builds in 95% of individuals, enter 0.95. [Default: 0.9] -mp, --max_paralog_inds Count cutoff for removing loci that are possible paralogs from the final output. The value is the maximum allowable number of individuals with more than the expected number of haplotypes [Default: No filter] -ml, --max_low_cov_inds Count cutoff for removing loci with low coverage or genotyping errors from the final output. The value is the maximum allowable number of individuals with less than the expected number of haplotypes [Default: No filter] -g, --genepop Writes a genepop file using haplotypes. Must provide the name of the genepop file. -a, --ima Writes a IMa file using haplotypes. Must provide the name of the IMa file. -p, --popmap Tab-separated file of individuals and their population designation, one per line (required for Genepop output) -t, --tsvfile Writes a tsv file using haplotypes - for mapping crosses only. Must provide the name of the tsv file. -p1, --parent1 Parent 1 of the mapping cross (must be specified if writing a tsv file) -p2, --parent2 Parent 2 of the mapping cross (must be specified if writing a tsv file) -e, --debug Output extra logs for debugging purposes ``` We don't have enough time to go into depth with all these options and this tool is still under development. It also take some substantial resources to run. I will simulate running this for you. ```bash #rad_haplotyper.pl -v SNP.DP3g95p5maf05.HWE.recode.vcf -x 40 -mp 1 -u 20 -ml 4 -n -r reference.fasta ``` Note, this will not actually run. It needs all the BAM files to proceed. Here's the simulated output ```bash Removed 0 loci (0 SNPs) with more than 20 SNPs at a locus Building haplotypes for BR_024 Building haplotypes for BR_028 Building haplotypes for WL_054 Building haplotypes for BR_016 Building haplotypes for BR_009 Building haplotypes for BR_006 Building haplotypes for BR_041 Building haplotypes for BR_040 Building haplotypes for BR_046 Building haplotypes for BR_031 Building haplotypes for BR_025 Building haplotypes for BR_002 Building haplotypes for WL_058 Building haplotypes for WL_057 Building haplotypes for WL_061 Building haplotypes for WL_069 Building haplotypes for WL_070 Building haplotypes for BR_048 Building haplotypes for WL_031 Building haplotypes for WL_056 Building haplotypes for BR_047 Building haplotypes for WL_079 Building haplotypes for WL_080 Building haplotypes for WL_032 Building haplotypes for WL_071 Building haplotypes for WL_081 Building haplotypes for BR_004 Building haplotypes for BR_021 Building haplotypes for BR_015 Building haplotypes for BR_043 Building haplotypes for WL_066 Filtered 26 loci below missing data cutoff Filtered 66 possible paralogs Filtered 17 loci with low coverage or genotyping errors Filtered 0 loci with an excess of haplotypes ``` The script found another 109 loci to remove from our file. Besides this output to the terminal, the script outputs a file called stats.out ```bash head stats.out Locus Sites Haplotypes Inds_Haplotyped Total_Inds Prop_Haplotyped Status Poss_Paralog Low_Cov/Geno_Err Miss_Geno Comment E10001_L101 1 2 30 31 0.968 PASSED 0 0 1 E10003_L101 7 9 30 31 0.968 PASSED 1 0 0 E10004_L101 - - - - - FILTERED0 0 1 Complex E10008_L101 1 2 30 31 0.968 PASSED 0 0 1 E10013_L142 3 6 30 31 0.968 PASSED 0 0 1 E10014_L117 2 3 31 31 1.000 PASSED 0 0 0 E10024_L101 1 2 31 31 1.000 PASSED 0 0 0 E10029_L101 1 2 31 31 1.000 PASSED 0 0 0 ``` We can use this file to create a list of loci to filter ```bash grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove ``` Now that we have the list we can parse through the VCF file and remove the bad RAD loci I've made a simple script to do this remove.bad.hap.loci.sh ```bash curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/remove.bad.hap.loci.sh chmod +x remove.bad.hap.loci.sh ./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf05.HWE.recode.vcf ``` This produces a FINAL FINAL FINAL filtered VCF file SNP.DP3g95p5maf05.HWE.filtered.vcf ```bash mawk '!/#/' SNP.DP3g95p5maf05.HWE.filtered.vcf | wc -l ``` We're left with 7,666 SNPs! How many possible errors? ```bash ./ErrorCount.sh SNP.DP3g95p5maf05.HWE.filtered.vcf ``` ```bash This script counts the number of potential genotyping errors due to low read depth It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability. Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0 Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0 Potential genotyping errors from genotypes from only 3 reads range from 302 to 1014.72 Potential genotyping errors from genotypes from only 4 reads range from 162 to 822.232 Potential genotyping errors from genotypes from only 5 reads range from 88 to 669 31 number of individuals and 7666 equals 237646 total genotypes Total genotypes not counting missing data 237081 Total potential error rate is between 0.00232831816974 and 0.0105700245908 SCORCHED EARTH SCENARIO WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS????? The total SCORCHED EARTH error rate is 0.0330857386294. ``` **Congrats! You've finished the Filtering Tutorial**