1## For the Impatient 2 3```sh 4# Download bwakit (or from <http://sourceforge.net/projects/bio-bwa/files/bwakit/> manually) 5wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.12_x64-linux.tar.bz2/download \ 6 | gzip -dc | tar xf - 7# Generate the GRCh38+ALT+decoy+HLA and create the BWA index 8bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa 9bwa.kit/bwa index hs38DH.fa # create BWA index 10# mapping 11bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh # skip "|sh" to show command lines 12``` 13 14This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA 15genotypes on each gene and `out.hla.all` for other possible HLA genotypes. 16Please check out [bwa/bwakit/README.md][kithelp] for details. 17 18## Background 19 20GRCh38 consists of several components: chromosomal assembly, unlocalized contigs 21(chromosome known but location unknown), unplaced contigs (chromosome unknown) 22and ALT contigs (long clustered variations). The combination of the first three 23components is called the *primary assembly*. It is recommended to use the 24complete primary assembly for all analyses. Using ALT contigs in read mapping is 25tricky. 26 27GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary 28assembly. However, sequences that are highly diverged from the primary assembly 29only contribute a few million bp. Most subsequences of ALT contigs are nearly 30identical to the primary assembly. If we align sequence reads to GRCh38+ALT 31blindly, we will get many additional reads with zero mapping quality and miss 32variants on them. It is crucial to make mappers aware of ALTs. 33 34BWA-MEM is ALT-aware. It essentially computes mapping quality across the 35non-redundant content of the primary assembly plus the ALT contigs and is free 36of the problem above. 37 38## Methods 39 40### Sequence alignment 41 42As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and 43postprocessing. The `bwa.kit/run-bwamem` script performs the two steps when ALT 44contigs are present. The following picture shows an example about how BWA-MEM 45infers mapping quality and reports alignment after step 2: 46 47![](http://lh3lh3.users.sourceforge.net/images/alt-demo.png) 48 49#### Step 1: BWA-MEM mapping 50 51At this step, BWA-MEM reads the ALT contig names from "*idxbase*.alt", ignoring 52the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*, 53depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports 54alignments and assigns mapQ following these two rules: 55 561. The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of 57 an ALT hit is computed across all hits. 58 592. If there are no non-ALT hits, the best ALT hit is outputted as the primary 60 alignment. If there are both ALT and non-ALT hits, non-ALT hits will be 61 primary and ALT hits be supplementary (SAM flag 0x800). 62 63In theory, non-ALT alignments from step 1 should be identical to alignments 64against the reference genome with ALT contigs. In practice, the two types of 65alignments may differ in rare cases due to seeding heuristics. When an ALT hit 66is significantly better than non-ALT hits, BWA-MEM may miss seeds on the 67non-ALT hits. 68 69If we don't care about ALT hits, we may skip postprocessing (step 2). 70Nonetheless, postprocessing is recommended as it improves mapQ and gives more 71information about ALT hits. 72 73#### Step 2: Postprocessing 74 75Postprocessing is done with a separate script `bwa-postalt.js`. It reads all 76potential hits reported in the XA tag, lifts ALT hits to the chromosomal 77positions using the ALT-to-ref alignment, groups them based on overlaps between 78their lifted positions, and then re-estimates mapQ across the best scoring hit 79in each group. Being aware of the ALT-to-ref alignment, this script can greatly 80improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also 81writes each hit overlapping the reported hit into a separate SAM line. This 82enables variant calling on each ALT contig independent of others. 83 84### On the completeness of GRCh38+ALT 85 86While GRCh38 is much more complete than GRCh37, it is still missing some true 87human sequences. To make sure every piece of sequence in the reference assembly 88is correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig 89to have enough support from multiple sources before considering to add it to the 90reference assembly. This careful and sophisticated procedure has left out some 91sequences, one of which is [this example][novel], a 10kb contig assembled from 92CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or 93[BLAST][blast] to see where it maps. 94 95For a more complete reference genome, we compiled a new set of decoy sequences 96from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples. 97The sequences are included in `hs38DH-extra.fa` from the [BWA binary 98package][res]. 99 100In addition to decoy, we also put multiple alleles of HLA genes in 101`hs38DH-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], 102version 3.18.0 and are used to collect reads sequenced from these genes. 103 104### HLA typing 105 106HLA genes are known to be associated with many autoimmune diseases, infectious 107diseases and drug responses. They are among the most important genes but are 108rarely studied by WGS projects due to the high sequence divergence between 109HLA genes and the reference genome in these regions. 110 111By including the HLA gene regions in the reference assembly as ALT contigs, we 112are able to effectively identify reads coming from these genes. We also provide 113a pipeline, which is included in the [BWA binary package][res], to type the 114several classic HLA genes. The pipeline is conceptually simple. It de novo 115assembles sequence reads mapped to each gene, aligns exon sequences of each 116allele to the assembled contigs and then finds the pairs of alleles that best 117explain the contigs. In practice, however, the completeness of IMGT/HLA and 118copy-number changes related to these genes are not so straightforward to 119resolve. HLA typing may not always be successful. Users may also consider to use 120other programs for typing such as [Warren et al (2012)][hla4], [Liu et al 121(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though 122most of them are distributed under restrictive licenses. 123 124## Preliminary Evaluation 125 126To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878 127unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy 128(hs38DH), and called small variants from the alignment. CHM1 is haploid. 129Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The 130true positive (TP) heterozygous calls from NA12878 are approximately equal 131to the difference between NA12878 and CHM1 heterozygous calls. A better assembly 132should yield higher TP and lower FP. The following table shows the numbers for 133these assemblies: 134 135|Assembly|hs37 |hs38 |hs38DH|CHM1_1.1| huref| 136|:------:|------:|------:|------:|------:|------:| 137|FP | 255706| 168068| 142516|307172 | 575634| 138|TP |2142260|2163113|2150844|2167235|2137053| 139 140With this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces 141FP by ~25k but also reduces TP by ~12k. We manually inspected variants called 142from hs38 only and found the majority of them are associated with excessive read 143depth, clustered variants or weak alignment. We believe most hs38-only calls are 144problematic. In addition, if we compare two NA12878 replicates from HiSeq X10 145with nearly identical library construction, the difference is ~140k, an order 146of magnitude higher than the difference between hs38 and hs38DH. ALT contigs, 147decoy and HLA genes in hs38DH improve variant calling and enable the analyses of 148ALT contigs and HLA typing at little cost. 149 150## Problems and Future Development 151 152There are some uncertainties about ALT mappings - we are not sure whether they 153help biological discovery and don't know the best way to analyze them. Without 154clear demand from downstream analyses, it is very difficult to design the 155optimal mapping strategy. The current BWA-MEM method is just a start. If it 156turns out to be useful in research, we will probably rewrite bwa-postalt.js in C 157for performance; if not, we may make changes. It is also possible that we might 158make breakthrough on the representation of multiple genomes, in which case, we 159can even get rid of ALT contigs for good. 160 161 162 163[res]: https://sourceforge.net/projects/bio-bwa/files/bwakit 164[sb]: https://github.com/GregoryFaust/samblaster 165[grc]: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/ 166[novel]: https://gist.github.com/lh3/9935148b71f04ba1a8cc 167[blat]: https://genome.ucsc.edu/cgi-bin/hgBlat 168[blast]: http://blast.st-va.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome 169[sgdp]: http://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project/ 170[hladb]: http://www.ebi.ac.uk/ipd/imgt/hla/ 171[grcdef]: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml 172[hla1]: http://biorxiv.org/content/early/2014/07/08/006973 173[hlalink]: http://www.hladiseaseassociations.com 174[hlatools]: https://www.biostars.org/p/93245/ 175[hla2]: http://nar.oxfordjournals.org/content/41/14/e142.full.pdf+html 176[hla3]: http://www.biomedcentral.com/1471-2164/15/325 177[hla4]: http://genomemedicine.com/content/4/12/95 178[kithelp]: https://github.com/lh3/bwa/tree/master/bwakit 179