1#dDocent Filtering Tutorial 2Designed by Jon Puritz 3 4# GOALS 51. To learn how to use VCFtools to filter a VCF file for missing data, genotype depth, locus quality score, minor allele frequency, and genotype call depth 62. To learn how to use vcflib to filter FreeBayes VCF files generated with RAD data 73. To filter a VCF file for HWE within populations 84. How to decompose a VCF into SNPs and INDELs and 95. How to use a haplotyping script to further filter SNPs for paralogs and genotyping errors. 10 11*** 12 13# Tutorial 14Welcome to the SNP filtering exercise. For the first part of the exercise, the filtering steps should work on almost any VCF file. 15For the second part of the exercise, we are going to assume you are working with a VCF file that was generated by 16FreeBayes. Note that other SNP callers can be configured to include the same annotations. 17Let's find our way back to your original working directory and make a new filtering directory 18 19```bash 20mkdir filtering 21cd filtering 22``` 23 24Now, let's download some data to look at. 25 26```bash 27curl -L -o data.zip https://www.dropbox.com/sh/bf9jxviaoq57s5v/AAD2Kv5SPpHlZ7LC7sBz4va8a?dl=1 28unzip data.zip 29ll 30``` 31 32```bash 33total 165620 34-rwxr--r--. 1 jpuritz users 109633 Mar 6 14:56 BR_004-RG.bam 35-rwxr--r--. 1 jpuritz users 247496 Mar 6 14:57 BR_004-RG.bam.bai 36-rwxr--r--. 1 jpuritz users 120045 Mar 6 15:14 BR_006-RG.bam 37-rwxr--r--. 1 jpuritz users 247712 Mar 6 15:14 BR_006-RG.bam.bai 38-rw-r--r--. 1 jpuritz users 78979977 Mar 6 16:15 data.zip 39drwxr-xr-x. 2 jpuritz users 21 Mar 6 16:16 __MACOSX 40-rwxr--r--. 1 jpuritz users 399 Mar 6 15:08 popmap 41-rwxr--r--. 1 jpuritz users 68264393 Mar 6 13:40 raw.vcf.gz 42-rwxr--r--. 1 jpuritz users 6804314 Mar 6 14:49 reference.fasta 43-rwxr--r--. 1 jpuritz users 1085387 Mar 6 14:49 reference.fasta.amb 44-rwxr--r--. 1 jpuritz users 1068884 Mar 6 14:49 reference.fasta.ann 45-rwxr--r--. 1 jpuritz users 6379720 Mar 6 14:49 reference.fasta.bwt 46-rwxr--r--. 1 jpuritz users 353544 Mar 6 14:49 reference.fasta.clstr 47-rwxr--r--. 1 jpuritz users 976388 Mar 6 14:49 reference.fasta.fai 48-rwxr--r--. 1 jpuritz users 1594913 Mar 6 14:49 reference.fasta.pac 49-rwxr--r--. 1 jpuritz users 3189872 Mar 6 14:49 reference.fasta.sa 50-rwxr--r--. 1 jpuritz users 137209 Mar 6 14:30 stats.out 51``` 52 53To start, we are going to use the program VCFtools (http://vcftools.sourceforge.net) to filter our vcf file. This program has a binary executable 54and has several perl scripts as well that are useful for filtering. 55I find it much more useful to use version 0.1.11, since it has more useful filtering commands (I think). Let's load that version 56This raw.vcf file is going to have a lot of erroneous variant calls and a lot of variants that are only present in one individual. 57To make this file more manageable, let's start by applying three step filter. We are going to only keep variants that have been successfully genotyped in 5850% of individuals, a minimum quality score of 30, and a minor allele count of 3. 59 60``` 61vcftools --gzvcf raw.vcf.gz --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3 62``` 63In this code, we call vcftools, feed it a vcf file after the `--vcf` flag, `--max-missing 0.5` tells it to filter genotypes called below 50% (across all individuals) 64the `--mac 3` flag tells it to filter SNPs that have a minor allele count less than 3. 65This is relative to genotypes, so it has to be called in at least 1 homozygote and 1 heterozygote or 3 heterozygotes. 66The `--recode` flag tells the program to write a new vcf file with the filters, `--recode-INFO-all` keeps all the INFO flags from the old vcf file in the new one. 67Lastly, `--out` designates the name of the output 68The output will scroll through a lot of lines, but should end like: 69 70``` 71After filtering, kept 40 out of 40 Individuals 72After filtering, kept 78434 out of a possible 147540 Sites 73Outputting VCF file... Done 74Run Time = 40.00 seconds 75``` 76 77Those two simple filters got rid of 50% of the data and will make the next filtering steps run much faster. 78 79We now have a filtered VCF called raw.g5mac3.recode.vcf. There is also a logfile generated called raw.g5mac3.log 80The next filter we will apply is a minimum depth for a genotype call and a minimum mean depth 81 82```bash 83vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3 84``` 85This command will recode genotypes that have less than 3 reads. 86I'll give you a second to take a deep breath. 87Yes, we are keeping genotypes with as few as 3 reads. We talked about this in the lecture portion of this course, but the short answer is that 88sophisticated multisample variant callers like FreeBayes and GATK can confidently call genotypes with few reads because variants are assessed across all 89samples simultaneously. So, the genotype is based on three reads AND prior information from all reads from all individuals. Relax. We will do plenty of other filtering steps! 90 91Don't believe me do you? I've made a script to help evaluate the potential errors. 92 93```bash 94curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/ErrorCount.sh 95chmod +x ErrorCount.sh 96./ErrorCount.sh raw.g5mac3dp3.recode.vcf 97``` 98 99```bash 100This script counts the number of potential genotyping errors due to low read depth 101It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability. 102Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0 103Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0 104Potential genotyping errors from genotypes from only 3 reads range from 15986 to 53714.22 105Potential genotyping errors from genotypes from only 4 reads range from 6230 to 31502.04 106Potential genotyping errors from genotypes from only 5 reads range from 2493 to 18914 10740 number of individuals and 78434 equals 3137360 total genotypes 108Total genotypes not counting missing data 2380094 109Total potential error rate is between 0.0103815227466 and 0.0437504821238 110SCORCHED EARTH SCENARIO 111WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS????? 112The total SCORCHED EARTH error rate is 0.129149100834. 113``` 114Right now, the maximum error rate for our VCF file because of genotypes less than 5 reads is less than 5%. See, nothing to worry about. 115 116The next step is to get rid of individuals that did not sequence well. We can do this by assessing individual levels of missing data. 117 118```bash 119vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv 120``` 121This will create an output called out.imiss. Let's examine it. 122 123```bash 124cat out.imiss 125``` 126``` 127INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS 128BR_002 78434 0 13063 0.166548 129BR_004 78434 0 16084 0.205064 130BR_006 78434 0 25029 0.319109 131BR_009 78434 0 30481 0.38862 132BR_013 78434 0 69317 0.883762 133BR_015 78434 0 8861 0.112974 134BR_016 78434 0 29789 0.379797 135BR_021 78434 0 17422 0.222123 136BR_023 78434 0 43913 0.559872 137BR_024 78434 0 24220 0.308795 138BR_025 78434 0 21998 0.280465 139BR_028 78434 0 26786 0.34151 140BR_030 78434 0 74724 0.952699 141BR_031 78434 0 26488 0.337711 142BR_040 78434 0 19492 0.248515 143BR_041 78434 0 17107 0.218107 144BR_043 78434 0 16384 0.208889 145BR_046 78434 0 28770 0.366805 146BR_047 78434 0 13258 0.169034 147BR_048 78434 0 24505 0.312428 148WL_031 78434 0 22566 0.287707 149WL_032 78434 0 22604 0.288191 150WL_054 78434 0 32902 0.419486 151WL_056 78434 0 34106 0.434837 152WL_057 78434 0 37556 0.478823 153WL_058 78434 0 31448 0.400949 154WL_061 78434 0 35671 0.45479 155WL_064 78434 0 47816 0.609634 156WL_066 78434 0 10062 0.128286 157WL_067 78434 0 47940 0.611215 158WL_069 78434 0 38260 0.487799 159WL_070 78434 0 21188 0.270138 160WL_071 78434 0 16692 0.212816 161WL_072 78434 0 46347 0.590904 162WL_076 78434 0 78178 0.996736 163WL_077 78434 0 55193 0.703687 164WL_078 78434 0 54400 0.693577 165WL_079 78434 0 19457 0.248068 166WL_080 78434 0 30076 0.383456 167WL_081 78434 0 30334 0.386746 168``` 169You can see that some individuals have as high as 99.6% missing data. We definitely want to filter those out. Let's take a look at a histogram 170 171```bash 172mawk '!/IN/' out.imiss | cut -f5 > totalmissing 173gnuplot << \EOF 174set terminal dumb size 120, 30 175set autoscale 176unset label 177set title "Histogram of % missing data per individual" 178set ylabel "Number of Occurrences" 179set xlabel "% of missing data" 180#set yr [0:100000] 181binwidth=0.01 182bin(x,width)=width*floor(x/width) + binwidth/2.0 183plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes 184pause -1 185EOF 186``` 187 188```bash 189 Histogram of % missing data per individual 190Number of Occurrences 191 3 ++----------+---------***---------***-----------+------------+-----------+-----------+-----------+----------++ 192 + + * * * * + 'totalmissing' using (bin($1,binwidth)):(1.0) ****** + 193 | * * * * | 194 | * * * * | 195 | * * * * | 196 | * * * * | 197 2.5 ++ * * * * ++ 198 | * * * * | 199 | * * * * | 200 | * * * * | 201 | * * * * | 202 | * * * * | 203 2 ++ *************** * **** * * ++ 204 | * * * ** * * ** * * * | 205 | * * * ** * * ** * * * | 206 | * * * ** * * ** * * * | 207 | * * * ** * * ** * * * | 208 | * * * ** * * ** * * * | 209 1.5 ++ * * * ** * * ** * * * ++ 210 | * * * ** * * ** * * * | 211 | * * * ** * * ** * * * | 212 | * * * ** * * ** * * * | 213 | * * * ** * * ** * * * | 214 + * * +* ** * * ** * * * + + + + + + 215 1 +************************************************************************************************************* 216 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 217 % of missing data 218``` 219Most of the individuals have less than 0.5 missing data. That is probably a good cutoff to use for the moment. 220 221Now we need to create a list of individuals with more than 50% missing data. Anyone have any ideas? 222 223 224We can use mawk to do it. 225 226```bash 227mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv 228``` 229Take a moment to think about what this code is doing. 230 231Now that we have a list of individuals to remove, we can feed that directly into VCFtools for filtering. 232 233```bash 234vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm 235``` 236As you can see from the output, this removed 9 individuals. 237 238I've included a script called filter_missing_ind.sh that will automate this process for you in the future. Try it out. 239 240```bash 241curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_missing_ind.sh 242chmod +x filter_missing_ind.sh 243./filter_missing_ind.sh raw.g5mac3dp3.recode.vcf DP3g95maf05 244``` 245The command always follows the structure of filter_missing_ind.sh vcf_to_filter name_prefix_for_new_vcf 246The script prints out a histogram like the one above and also calculates the 85% for missing data. 247Enter "no" 248Now that we have removed poor coverage individuals, we can restrict the data to variants called in a high percentage of individuals and filter by mean depth of genotypes 249 250```bash 251vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 20 252``` 253This leaves us with about 12,754 loci in our filtered vcf file. 254 255This applied a genotype call rate (95%) across all individuals. With two localities, this is sufficient, but when you have multiple localities being sampled 256You are also going to want to filter by a population specific call rate. VCFtools won't calculate this directly, but it is an easy workaround. 257First we need a file to define localities (populations). Most programs want the file to have two tab separated columns. First with the sample name, second with population assignment. 258I've already made one for this exercise. 259 260``` 261cat popmap 262``` 263``` 264BR_002 BR 265BR_004 BR 266BR_006 BR 267BR_009 BR 268BR_013 BR 269BR_015 BR 270BR_016 BR 271BR_021 BR 272BR_023 BR 273BR_024 BR 274BR_025 BR 275BR_028 BR 276BR_030 BR 277BR_031 BR 278BR_040 BR 279BR_041 BR 280BR_043 BR 281BR_046 BR 282BR_047 BR 283BR_048 BR 284WL_031 WL 285WL_032 WL 286WL_054 WL 287WL_056 WL 288WL_057 WL 289WL_058 WL 290WL_061 WL 291WL_064 WL 292WL_066 WL 293WL_067 WL 294WL_069 WL 295WL_070 WL 296WL_071 WL 297WL_072 WL 298WL_076 WL 299WL_077 WL 300WL_078 WL 301WL_079 WL 302WL_080 WL 303WL_081 WL 304``` 305Now we need to create two lists that have just the individual names for each population 306 307```bash 308mawk '$2 == "BR"' popmap > 1.keep && mawk '$2 == "WL"' popmap > 2.keep 309``` 310The above line demonstrates the use of && to simultaneous execute two tasks. 311 312Next, we use VCFtools to estimate missing data for loci in each population 313 314```bash 315vcftools --vcf DP3g95maf05.recode.vcf --keep 1.keep --missing-site --out 1 316vcftools --vcf DP3g95maf05.recode.vcf --keep 2.keep --missing-site --out 2 317``` 318This will generate files named 1.lmiss and 2.lmiss 319 320They follow this format 321 322``` 323head -3 1.lmiss 324``` 325```bash 326CHR POS N_DATA N_GENOTYPE_FILTERED N_MISS F_MISS 327E1_L101 9 34 0 0 0 328E1_L101 15 34 0 0 0 329``` 330I added extra tabs to make this easier to read, but what we are interested in is that last column with is the percentage of missing data for that locus. 331We can combine the two files and make a list of loci about the threshold of 10% missing data to remove. Note this is double the overall rate of missing data. 332 333```bash 334cat 1.lmiss 2.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci 335``` 336Who can walk us through that line of code? 337 338We then feed this file back into VCFtools to remove any of the loci 339 340```bash 341vcftools --vcf DP3g95maf05.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf05 342``` 343 344Again, we only had two populations so our overall filter caught all of these. However, this will not be the case in multi-locality studies 345I also have made a script to automate this process as well. It's called pop_missing_filter.sh 346Executing it with no parameters will give you the usage. 347 348```bash 349curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/pop_missing_filter.sh 350chmod +x pop_missing_filter.sh 351./pop_missing_filter.sh 352``` 353 354``` 355Usage is pop_missing_filter vcffile popmap percent_missing_per_pop number_of_pops_for_cutoff name_for_output 356``` 357*** 358**From this point forward, the filtering steps assume that the vcf file was generated by FreeBayes** 359Note that other SNP callers can be configured to include the similar annotations. 360 361FreeBayes outputs a lot of information about a locus in the VCF file, using this information and the properties of RADseq, we add some sophisticated filters to the data. 362Let's take a look at the header of our VCF file and take a quick look at all the information. 363 364```bash 365mawk '/#/' DP3g95maf05.recode.vcf 366``` 367This will output several lines of INFO tags, I have highlighted a few below: 368 369``` 370INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data"> 371INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus"> 372INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred"> 373INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred"> 374INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand"> 375INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand"> 376INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand"> 377INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand"> 378INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous"> 379INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> 380INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing. Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR."> 381INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles"> 382INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles"> 383INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments"> 384INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments"> 385FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> 386FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> 387FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> 388FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> 389FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> 390FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> 391FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> 392FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> 393``` 394The first filter we will apply will be on allele balance. Allele balance is: 395a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous 396Because RADseq targets specific locations of the genome, we expect that the allele balance in our data (for real loci) should be close to 0.5 397We can use the vcffilter program from vcflib. (https://github.com/ekg/vcflib) 398Typing it with no parameters will give you the usage. 399 400```bash 401vcffilter 402``` 403``` 404usage: vcffilter [options] <vcf file> 405 406options: 407 -f, --info-filter specifies a filter to apply to the info fields of records, 408 removes alleles which do not pass the filter 409 -g, --genotype-filter specifies a filter to apply to the genotype fields of records 410 -k, --keep-info used in conjunction with '-g', keeps variant info, but removes genotype 411 -s, --filter-sites filter entire records, not just alleles 412 -t, --tag-pass tag vcf records as positively filtered with this tag, print all records 413 -F, --tag-fail tag vcf records as negatively filtered with this tag, print all records 414 -A, --append-filter append the existing filter tag, don't just replace it 415 -a, --allele-tag apply -t on a per-allele basis. adds or sets the corresponding INFO field tag 416 -v, --invert inverts the filter, e.g. grep -v 417 -o, --or use logical OR instead of AND to combine filters 418 -r, --region specify a region on which to target the filtering, requires a BGZF 419 compressed file which has been indexed with tabix. any number of 420 regions may be specified. 421``` 422Let's use our first filter 423 424```bash 425vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95p5maf05.recode.vcf > DP3g95p5maf05.fil1.vcf 426``` 427vcffilter works with simple conditional statements, so this filters out loci with an allele balance below 0.25 and above 0.75. However, it does include those that are close to zero. 428The last condition is to catch loci that are fixed variants (all individuals are homozygous for one of the two variants). 429The `-s` tells the filter to apply to sites, not just alleles 430To see how many loci are now in the VCF file, you could feed it into VCFtools or you can just use a simple mawk statement 431 432```bash 433mawk '!/#/' DP3g95p5maf05.recode.vcf | wc -l 43412754 435``` 436 437```bash 438mawk '!/#/' DP3g95p5maf05.fil1.vcf | wc -l 4399678 440``` 441You'll notice that we've filtered a lot of loci. In my experience though, I find that most of these tend to be errors of some kind. However, this will be data dependent. I encourage you to explore your own data sets. 442 443The next filter we will apply filters out sites that have reads from both strands. For GWAS and even RNAseq, this can be a good thing. 444Unless you are using super small genomic fragment or really long reads (MiSeq). A SNP should be covered only by forward or only reverse reads. 445 446```bash 447vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s DP3g95p5maf05.fil1.vcf > DP3g95p5maf05.fil2.vcf 448``` 449The filter is based on proportions, so that a few extraneous reads won't remove an entire locus. In plain english, it's keeping loci that have over 100 times more forward alternate reads than reverse alternate reads and 100 times more forward reference reads than reverse reference reads along with the reciprocal. 450 451```bash 452mawk '!/#/' DP3g95p5maf05.fil2.vcf | wc -l 4539491 454``` 455That only removes a small proportion of loci, but these loci are likely to be paralogs, microbe contamination, or weird PCR chimeras. 456 457The program SAMtools is a great way to visualize alignments right from the terminal. 458 459```bash 460samtools tview BR_006-RG.bam reference.fasta -p E28188_L151 461``` 462 463```bash 464 11 21 31 41 51 61 71 81 91 101 111 121 131 465AATTCTCAGAGCTAGAGTGGGGACGGCAGTTGGTAGAGGGTACAGCAGTTCTAAAAACATGTAGAAATTTTCTCTTCAACTCGCTCCTACGGCCACAGCGTTCACTCCACATACACAAATTGTACACCAAAACATAGGAAAAG 466...........S...........Y.K......S.........G.......K.........S............................Y........Y....W.........................M...G......... 467..........................................G.......G......................................T..... 468..........................................G.......T............................................ 469...........G...........T.T......C.........G.................C............................... 470..........................................G.......T....................................G....... 471..........................................G.......T............................................ 472...........G...........T.T......C.........G.................C.................................. 473..........................................G.......T............................................ 474...........G.................C............G.......G.................******........A...............T.. 475 ,,,,,,g,,,,,,,,,,,,,,,,,******,,,,,,,,a,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, 476 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,, 477 t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 478 ,,,,,,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, 479 g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 480 t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, 481 g,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 482 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,, 483 ,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,g,,,,,,,,, 484 t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 485 g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,a,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 486 g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 487 ,,,,,,,,,,c,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,g,,,,,,,,, 488 g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 489 t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 490 g,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 491 g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,c,,,g,,,,,,,,, 492``` 493As you can see this is a mess. There appear to be over two haplotypes here. 494For more info on how to use samtools tview press the question mark while you are in the window. 495 496The next filter looks at the ratio of mapping qualities between reference and alternate alleles 497 498```bash 499vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" DP3g95p5maf05.fil2.vcf > DP3g95p5maf05.fil3.vcf 500``` 501The rationale here is that, again, because RADseq loci and alleles all should start from the same genomic location there should not be large discrepancy 502between the mapping qualities of two alleles. 503 504```bash 505mawk '!/#/' DP3g95p5maf05.fil3.vcf | wc -l 5069229 507``` 508This filters away less than 3% of the variants, but likely need to be filtered. Let's take a look at one 509 510```bash 511samtools tview BR_004-RG.bam reference.fasta -p E20_L173 512``` 513 514```bash 5151 11 21 31 41 51 61 71 81 91 101 111 121 131 516NAATTCATCTGTTGCAGGCAGCTCACACTTGCAGCCTCGGCTCGCACCAGCAGAGCAGCCGTAGAATACTTAGTTTAATAGAATGGCTTGGCATTTNNNNNNNNNNCATGAGGTTGTTATTCTCAGAAGACTAATCACAGACA 517 .......Y.........YM....WS...Y....S...R....R....................................................C ....................G................ 518 .......T.........TC....TG...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, 519 .......T.........TC.....G...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, 520 .......T.........TC....TG...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, 521 .......T.........TC.....G...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, 522 .......T.........TC.....G...C....G...A....A.................................. ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,, 523 .......T.........TC....TG...C....G...A....A.................................. 524 .......T.........TC.....G...C....G...A....A.................................. 525 .......T.........TC.....G...C....G...A....A.................................. 526 .......T.........TC....TG...C....G...A....A.................................. 527 .......T.........TC....TG...C....G...A....A.................................. 528 ..........C....................................................................................C 529 .......T.........TC.....G...C....G...A....A.................................. 530 ..........C....................................................................................C 531 .......T.........TC.....G...C....G...A....A.................................. 532 ......GT.........TC.....G...C....G...A....A.................................. 533 .......T.........TC....TG...C....G...A....A.................................. 534``` 535There is a large amount of clipping going on here for the variant alleles likely why the mapping quality is low for them. You can also see that there are three different alleles present here. Press SHIFT+L to scroll further down the alignment. You can see that some of the polymorphism is also link to a cut site variant. All things that should be avoided. 536 537Yet another filter that can be applied is whether or not their is a discrepancy in the properly paired status of for reads supporting reference or alternate alleles. 538 539```bash 540vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s DP3g95p5maf05.fil3.vcf > DP3g95p5maf05.fil4.vcf 541``` 542Since de novo assembly is not perfect, some loci will only have unpaired reads mapping to them. This is not a problem. The problem occurs when all the reads supporting the reference allele 543are paired but not supporting the alternate allele. That is indicative of a problem. 544 545```bash 546mawk '!/#/' DP3g95p5maf05.fil4.vcf | wc -l 5479166 548``` 549Our loci count keeps dwindling, but our signal to noise ration keeps increasing. Let's look at an example of what we filtered. 550 551```bash 552samtools tview BR_006-RG.bam reference.fasta -p E4407_L138 553``` 554This output doesn't paste well to the terminal, but you can see the clear discrepancy between mapping status and allele status. This could be indicative of cut site polymorphism or paralogs. 555The next filter we will apply is to look at the ration of locus quality score to depth 556Heng Li found some interesting results about how quality score and locus depth are related to each other in real and spurious variant calls 557See his preprint here (http://arxiv.org/pdf/1404.0929.pdf) 558Also see this great blog post about it here (http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/) I **REALLY** recommend following that blog. Brad Chapman's group is really good. 559In short, with whole genome samples, it was found that high coverage can lead to inflated locus quality scores. Heng proposed that for read depths greater than the mean depth plus 2-3 times 560the square root of mean depth that the quality score will be twice as large as the depth in real variants and below that value for false variants. 561I actually found that this is a little too conservative for RADseq data, likely because the reads aren't randomly distributed across contigs. I implement two filters based on this idea. 562the first is removing any locus that has a quality score below 1/4 of the depth. 563 564```bash 565vcffilter -f "QUAL / DP > 0.25" DP3g95p5maf05.fil4.vcf > DP3g95p5maf05.fil5.vcf 566``` 567The second is more complicated. The first step is to create a list of the depth of each locus 568 569```bash 570cut -f8 DP3g95p5maf05.fil5.vcf | grep -oe "DP=[0-9]*" | sed -s 's/DP=//g' > DP3g95p5maf05.fil5.DEPTH 571``` 572The second step is to create a list of quality scores. 573 574```bash 575mawk '!/#/' DP3g95p5maf05.fil5.vcf | cut -f1,2,6 > DP3g95p5maf05.fil5.vcf.loci.qual 576``` 577Next step is to calculate the mean depth 578 579```bash 580mawk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' DP3g95p5maf05.fil5.DEPTH 5811952.82 582``` 583Now the the mean plus 3X the square of the mean 584 585```bash 586python -c "print int(1952+3*(1952**0.5))" 5872084 588``` 589Next we paste the depth and quality files together and find the loci above the cutoff that do not have quality scores 2 times the depth 590 591```bash 592paste DP3g95p5maf05.fil5.vcf.loci.qual DP3g95p5maf05.fil5.DEPTH | mawk -v x=2084 '$4 > x' | mawk '$3 < 2 * $4' > DP3g95p5maf05.fil5.lowQDloci 593``` 594Now we can remove those sites and recalculate the depth across loci with VCFtools 595 596```bash 597vcftools --vcf DP3g95p5maf05.fil5.vcf --site-depth --exclude-positions DP3g95p5maf05.fil5.lowQDloci --out DP3g95p5maf05.fil5 598``` 599Now let's take VCFtools output and cut it to only the depth scores 600 601```bash 602cut -f3 DP3g95p5maf05.fil5.ldepth > DP3g95p5maf05.fil5.site.depth 603``` 604Now let's calculate the average depth by dividing the above file by the number of individuals 31 605 606```bash 607mawk '!/D/' DP3g95p5maf05.fil5.site.depth | mawk -v x=31 '{print $1/x}' > meandepthpersite 608``` 609Let's plot the data as a histogram 610 611```bash 612gnuplot << \EOF 613set terminal dumb size 120, 30 614set autoscale 615set xrange [10:150] 616unset label 617set title "Histogram of mean depth per site" 618set ylabel "Number of Occurrences" 619set xlabel "Mean Depth" 620binwidth=1 621bin(x,width)=width*floor(x/width) + binwidth/2.0 622set xtics 5 623plot 'meandepthpersite' using (bin($1,binwidth)):(1.0) smooth freq with boxes 624pause -1 625EOF 626``` 627 628```bash 629 Histogram of mean depth per site 630Number of Occurrences 631 250 ++--+---+---+---+--+---+---+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+--++ 632 + + + + + + + + + + + + + +'meandepthpersite' using (bin($1,binwidth)):(1.0)+****** + 633 | ** ** | 634 | *** ** | 635 | ******** | 636 200 ++ ********* * ++ 637 | ** ********** * | 638 | ************* ** * | 639 | **************** * * | 640 | ******************* ** | 641 150 ++ ******************* *** ++ 642 | ************************* | 643 | *************************** | 644 | ******************************* | 645 100 ++ ********************************** ++ 646 | ********************************** | 647 | ************************************** * | 648 | ************************************** * ** | 649 | ********************************************* | 650 50 ++ ************************************************ ++ 651 | ************************************************ ** ** | 652 | *********************************************************** ** | 653 | **************************************************************** ** ** ** | 654 + + ************************************************************************************** ******+******** 655 0 ++--+---****************************************************************************************************** 656 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125130 135 140 145 150 657 Mean Depth 658``` 659Loci that have high mean depth are indicative of either paralogs or multicopy loci. Either way we want to remove them. Here, I'd remove all loci above a mean depth of 102.5. Now we can combine both filters to produce another VCF file 660 661```bash 662vcftools --vcf DP3g95p5maf05.fil5.vcf --recode-INFO-all --out DP3g95p5maf05.FIL --max-meanDP 102.5 --exclude-positions DP3g95p5maf05.fil5.lowQDloci --recode 663``` 664In the end, VCFtools kept 8417 out of a possible 9164 Sites. 665BTW, I've also written a script to automate the filterings steps described in steps 23-44. It's called dDocent_filters. It will go through the filtering steps and 666recode a log file for you for each of the steps, including the depth histogram. 667 668```bash 669curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/dDocent_filters 670chmod +x dDocent_filters 671./dDocent_filters 672``` 673``` 674This script will automatically filter a FreeBayes generated VCF file using criteria related to site depth, 675quality versus depth, strand representation, allelic balance at heterzygous individuals, and paired read representation. 676The script assumes that loci and individuals with low call rates (or depth) have already been removed. 677 678Contact Jon Puritz (jpuritz@gmail.com) for questions and see script comments for more details on particular filters 679 680Usage is sh FB_filters.sh VCF_file Output_prefix 681``` 682 683The next filter to apply is HWE. Heng Li also found that HWE is another excellent filter to remove erroneous variant calls. 684We don't want to apply it across the board, since population structure will create departures from HWE as well. We need to apply this by population. 685I've included a perl script written by Chris Hollenbeck, one of the PhD student's in my current lab that will do this for us. 686 687```bash 688curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_hwe_by_pop.pl 689chmod +x filter_hwe_by_pop.pl 690``` 691 692```bash 693./filter_hwe_by_pop.pl 694 Usage: 695 filter_hwe_by_pop.pl -v <vcffile> -p <popmap> [options] 696 697 Options: -v <vcffile> input vcf file -p <popmap> tab-separated file of 698 samples and population designations -h [hwe] minimum Hardy-Weinberg 699 p-value cutoff for SNPs -c [cutoff] proportion of all populations that a 700 locus can be below HWE cutoff without being filtered -o [out] name of 701 outfile 702 703 Options: 704 -v, --vcffile 705 VCF input file 706 707 -p, --popmap 708 File with names of individuals and population designations, one 709 per line 710 711 -h, --hwe 712 Minimum cutoff for Hardy-Weinberg p-value (for test as 713 implemented in vcftools) [Default: 0.001] 714 715 -c, --cutoff 716 Proportion of all populations that a locus can be below HWE 717 cutoff without being filtered. For example, choosing 0.5 will 718 filter SNPs that are below the p-value threshold in 50% or more 719 of the populations. [Default: 0.25] 720 721 -o, --out 722 Name of outfile, by vcftools conventions (will be named 723 X.recode.vcf) 724``` 725Let's filter our SNPs by population specific HWE 726First, we need to convert our variant calls to SNPs 727To do this we will use another command from vcflib called vcfallelicprimatives 728 729```bash 730vcfallelicprimitives DP3g95p5maf05.FIL.recode.vcf --keep-info --keep-geno > DP3g95p5maf05.prim.vcf 731``` 732This will decompose complex variant calls into phased SNP and INDEL genotypes and keep the INFO flags for loci and genotypes. 733Next, we can feed this VCF file into VCFtools to remove indels. 734 735```bash 736vcftools --vcf DP3g95p5maf05.prim.vcf --remove-indels --recode --recode-INFO-all --out SNP.DP3g95p5maf05 737``` 738We now have 8379 SNP calls in our new VCF. 739Now, let's apply the HWE filter 740 741```bash 742./filter_hwe_by_pop.pl -v SNP.DP3g95p5maf05.recode.vcf -p popmap -o SNP.DP3g95p5maf05.HWE -h 0.01 743``` 744 745```bash 746Processing population: BR (20 inds) 747Processing population: WL (20 inds) 748Outputting results of HWE test for filtered loci to 'filtered.hwe' 749Kept 8176 of a possible 8379 loci (filtered 203 loci) 750``` 751*Note, I would not normally use such a high `-h` value. It's purely for this example. Typically, errors would have a low p-value and would be present in many populations.* 752 753We have now created a thoroughly filtered VCF, and we should have confidence in these SNP calls. 754However, our lab is currently developing one more script, called rad_haplotyper. 755This tool takes a VCF file of SNPs and will parse through BAM files looking to link SNPs into haplotypes along paired reads. 756```bash 757curl -L -O https://raw.githubusercontent.com/jpuritz/WinterSchool.2016/master/rad_haplotyper.pl 758chmod +x rad_haplotyper.pl 759 760``` 761Note, this script requires several Perl libraries. See the README [here](https://github.com/jpuritz/NTU-Workshop/blob/master/scripts/README.md) 762It has a lot of options, let's take a look 763 764```bash 765perl rad_haplotyper.pl 766Usage: 767 perl rad_haplotyper.pl -v <vcffile> -r <reference> [options] 768 769 Options: -v <vcffile> input vcf file 770 771 -r <reference> reference genome 772 773 -s [samples] optionally specify an individual sample to be haplotyped 774 775 -u [snp_cutoff] remove loci with more than a specified number of SNPs 776 777 -h [hap_cutoff] remove loci with more than a specified number of haplotypes relative to SNPs 778 779 -m [miss_cutoff] cutoff for proportion of missing data for loci to be included in the output 780 781 -mp [max_paralog_inds] cutoff for excluding possible paralogs 782 783 -ml [max_low_cov_inds] cutoff for excluding loci with low coverage or genotyping errors 784 785 -d [depth] sampling depth used by the algorithm to build haplotypes 786 787 -g [genepop] genepop file for population output 788 789 -p [popmap] population map for organizing Genepop file 790 791 -t [tsvfile] tsv file for linkage map output 792 793 -a [imafile] IMa file output 794 795 -p1 [parent1] first parent in the mapping cross 796 797 -p2 [parent2] second parent in the mapping cross 798 799 -x [threads] number of threads to use for the analysis 800 801 -n use indels 802 803 -e debug 804 805Options: 806 -v, --vcffile 807 VCF input file 808 809 -r, --reference 810 Reference genome (FASTA format) 811 812 -s, --samples 813 Individual samples to use in the analysis - can be used multiple 814 times for multiple individuals [Default: All] 815 816 -u, --cutoff 817 Excludes loci with more than the specified number of SNPs 818 [Default: No filter] 819 820 -h, --hap_count 821 Excludes loci with more than the specified number of haplotypes 822 relative to number of SNPs. Excluding forces other than mutation 823 (i.e. recombination) the maximum number of haplotypes should be 824 one more than the number of SNPs at the locus. The value 825 provided is the number of haplotypes allowed in excess of the 826 number of SNPs, which allows that mechanisms other than mutation 827 may have influenced the number of haplotypes in the population. 828 [Default: 100] 829 830 -x, --threads 831 Run in parallel across individuals with a specified number of 832 threads 833 834 -n, --indels 835 Includes indels that are the only polymorphism at the locus 836 (tag) 837 838 -d, --depth 839 Specify a depth of sampling reads for building haplotypes 840 [Default: 20] 841 842 -m, --miss_cutoff 843 Proportion of missing data cutoff for removing loci from the 844 final output. For example, to keep only loci with successful 845 haplotype builds in 95% of individuals, enter 0.95. [Default: 846 0.9] 847 848 -mp, --max_paralog_inds 849 Count cutoff for removing loci that are possible paralogs from 850 the final output. The value is the maximum allowable number of 851 individuals with more than the expected number of haplotypes 852 [Default: No filter] 853 854 -ml, --max_low_cov_inds 855 Count cutoff for removing loci with low coverage or genotyping 856 errors from the final output. The value is the maximum allowable 857 number of individuals with less than the expected number of 858 haplotypes [Default: No filter] 859 860 -g, --genepop 861 Writes a genepop file using haplotypes. Must provide the name of 862 the genepop file. 863 864 -a, --ima 865 Writes a IMa file using haplotypes. Must provide the name of the 866 IMa file. 867 868 -p, --popmap 869 Tab-separated file of individuals and their population 870 designation, one per line (required for Genepop output) 871 872 -t, --tsvfile 873 Writes a tsv file using haplotypes - for mapping crosses only. 874 Must provide the name of the tsv file. 875 876 -p1, --parent1 877 Parent 1 of the mapping cross (must be specified if writing a 878 tsv file) 879 880 -p2, --parent2 881 Parent 2 of the mapping cross (must be specified if writing a 882 tsv file) 883 884 -e, --debug 885 Output extra logs for debugging purposes 886``` 887We don't have enough time to go into depth with all these options and this tool is still under development. 888It also take some substantial resources to run. I will simulate running this for you. 889 890```bash 891#rad_haplotyper.pl -v SNP.DP3g95p5maf05.HWE.recode.vcf -x 40 -mp 1 -u 20 -ml 4 -n -r reference.fasta 892``` 893Note, this will not actually run. It needs all the BAM files to proceed. 894 895Here's the simulated output 896 897```bash 898Removed 0 loci (0 SNPs) with more than 20 SNPs at a locus 899Building haplotypes for BR_024 900Building haplotypes for BR_028 901Building haplotypes for WL_054 902Building haplotypes for BR_016 903Building haplotypes for BR_009 904Building haplotypes for BR_006 905Building haplotypes for BR_041 906Building haplotypes for BR_040 907Building haplotypes for BR_046 908Building haplotypes for BR_031 909Building haplotypes for BR_025 910Building haplotypes for BR_002 911Building haplotypes for WL_058 912Building haplotypes for WL_057 913Building haplotypes for WL_061 914Building haplotypes for WL_069 915Building haplotypes for WL_070 916Building haplotypes for BR_048 917Building haplotypes for WL_031 918Building haplotypes for WL_056 919Building haplotypes for BR_047 920Building haplotypes for WL_079 921Building haplotypes for WL_080 922Building haplotypes for WL_032 923Building haplotypes for WL_071 924Building haplotypes for WL_081 925Building haplotypes for BR_004 926Building haplotypes for BR_021 927Building haplotypes for BR_015 928Building haplotypes for BR_043 929Building haplotypes for WL_066 930Filtered 26 loci below missing data cutoff 931Filtered 66 possible paralogs 932Filtered 17 loci with low coverage or genotyping errors 933Filtered 0 loci with an excess of haplotypes 934``` 935The script found another 109 loci to remove from our file. Besides this output to the terminal, the script outputs a file called stats.out 936 937```bash 938head stats.out 939Locus Sites Haplotypes Inds_Haplotyped Total_Inds Prop_Haplotyped Status Poss_Paralog Low_Cov/Geno_Err Miss_Geno Comment 940E10001_L101 1 2 30 31 0.968 PASSED 0 0 1 941E10003_L101 7 9 30 31 0.968 PASSED 1 0 0 942E10004_L101 - - - - - FILTERED0 0 1 Complex 943E10008_L101 1 2 30 31 0.968 PASSED 0 0 1 944E10013_L142 3 6 30 31 0.968 PASSED 0 0 1 945E10014_L117 2 3 31 31 1.000 PASSED 0 0 0 946E10024_L101 1 2 31 31 1.000 PASSED 0 0 0 947E10029_L101 1 2 31 31 1.000 PASSED 0 0 0 948``` 949We can use this file to create a list of loci to filter 950 951```bash 952grep FILTERED stats.out | mawk '!/Complex/' | cut -f1 > loci.to.remove 953``` 954Now that we have the list we can parse through the VCF file and remove the bad RAD loci 955I've made a simple script to do this remove.bad.hap.loci.sh 956 957```bash 958curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/remove.bad.hap.loci.sh 959chmod +x remove.bad.hap.loci.sh 960./remove.bad.hap.loci.sh loci.to.remove SNP.DP3g95p5maf05.HWE.recode.vcf 961``` 962This produces a FINAL FINAL FINAL filtered VCF file SNP.DP3g95p5maf05.HWE.filtered.vcf 963 964```bash 965mawk '!/#/' SNP.DP3g95p5maf05.HWE.filtered.vcf | wc -l 966``` 967We're left with 7,666 SNPs! 968How many possible errors? 969 970```bash 971./ErrorCount.sh SNP.DP3g95p5maf05.HWE.filtered.vcf 972``` 973 974```bash 975This script counts the number of potential genotyping errors due to low read depth 976It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability. 977Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0 978Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0 979Potential genotyping errors from genotypes from only 3 reads range from 302 to 1014.72 980Potential genotyping errors from genotypes from only 4 reads range from 162 to 822.232 981Potential genotyping errors from genotypes from only 5 reads range from 88 to 669 98231 number of individuals and 7666 equals 237646 total genotypes 983Total genotypes not counting missing data 237081 984Total potential error rate is between 0.00232831816974 and 0.0105700245908 985SCORCHED EARTH SCENARIO 986WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS????? 987The total SCORCHED EARTH error rate is 0.0330857386294. 988 989``` 990 991**Congrats! You've finished the Filtering Tutorial** 992