• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/H10-Feb-2021-150121

bin/H07-May-2022-

contrib/H03-May-2022-125,45692,342

examples/H10-Feb-2021-3026

intervaltree/H10-Dec-2019-10,8358,636

paper/H03-May-2022-2,6212,275

python/H10-Feb-2021-727510

scripts/H10-Feb-2021-252196

src/H03-May-2022-115,991111,951

test/H03-May-2022-2,9102,394

ttmath/H10-Feb-2021-24,52212,321

vcflib/H03-May-2022-371,506295,144

vcflib-temp/H03-May-2022-133,840116,949

.gitignoreH A D10-Feb-2021182 2019

.gitmodulesH A D10-Feb-2021490 1615

.travis.ymlH A D10-Feb-2021519 2726

LICENSEH A D10-Feb-20211 KiB2016

README.mdH A D10-Feb-202124.4 KiB497352

RELEASE-NOTES.mdH A D10-Feb-20212.4 KiB6045

guix.scmH A D10-Feb-20212.9 KiB9070

meson.buildH A D10-Feb-20217.2 KiB226203

README.md

1# *freebayes*, a haplotype-based variant detector
2## user manual and guide
3
4
5[![Github-CI](https://github.com/freebayes/freebayes/workflows/CI/badge.svg)](https://github.com/freebayes/freebayes/actions) [![Travis-CI](https://travis-ci.com/freebayes/freebayes.svg?branch=master)](https://travis-ci.com/freebayes/freebayes) [![AnacondaBadge](https://anaconda.org/bioconda/freebayes/badges/installer/conda.svg)](https://anaconda.org/bioconda/freebayes) [![DL](https://anaconda.org/bioconda/freebayes/badges/downloads.svg)](https://anaconda.org/bioconda/freebayes) [![BrewBadge](https://img.shields.io/badge/%F0%9F%8D%BAbrew-freebayes-brightgreen.svg)](https://github.com/brewsci/homebrew-bio) [![GuixBadge](https://img.shields.io/badge/gnuguix-freebayes-brightgreen.svg)](https://www.gnu.org/software/guix/packages/F/) [![DebianBadge](https://badges.debian.net/badges/debian/testing/freebayes/version.svg)](https://packages.debian.org/testing/freebayes)
6
7--------
8
9## Overview
10
11[*freebayes*](http://arxiv.org/abs/1207.3907) is a
12[Bayesian](http://en.wikipedia.org/wiki/Bayesian_inference) genetic variant
13detector designed to find small polymorphisms, specifically SNPs
14(single-nucleotide polymorphisms), indels (insertions and deletions), MNPs
15(multi-nucleotide polymorphisms), and complex events (composite insertion and
16substitution events) smaller than the length of a short-read sequencing
17alignment.
18
19*freebayes* is haplotype-based, in the sense that it calls variants based on
20the literal sequences of reads aligned to a particular target, not their
21precise alignment.  This model is a straightforward generalization of previous
22ones (e.g. PolyBayes, samtools, GATK) which detect or report variants based on
23alignments.  This method avoids one of the core problems with alignment-based
24variant detection--- that identical sequences may have multiple possible
25alignments:
26
27<img src="https://github.com/freebayes/freebayes/raw/v1.3.0/paper/haplotype_calling.png" width=500/>
28
29*freebayes* uses short-read alignments
30([BAM](http://samtools.sourceforge.net/SAMv1.pdf) files with
31[Phred+33](http://en.wikipedia.org/wiki/Phred_quality_score) encoded quality
32scores, now standard) for any number of individuals from a population and a
33[reference genome](http://en.wikipedia.org/wiki/Reference_genome) (in
34[FASTA](http://en.wikipedia.org/wiki/FASTA_format) format)
35to determine the most-likely combination of genotypes for the population at
36each position in the reference.  It reports positions which it finds putatively
37polymorphic in variant call file ([VCF](http://www.1000genomes.org/node/101))
38format.  It can also use an input set of variants (VCF) as a source of prior
39information, and a copy number variant map (BED) to define non-uniform ploidy
40variation across the samples under analysis.
41
42## Citing freebayes
43
44A preprint [Haplotype-based variant detection from short-read sequencing](http://arxiv.org/abs/1207.3907) provides an overview of the
45statistical models used in freebayes.
46We ask that you cite this paper if you use freebayes in work that leads to publication.
47This preprint is used for documentation and citation.
48freebayes was never submitted for review, but has been used in over 1000 publications.
49
50Please use this citation format:
51
52Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. *arXiv preprint arXiv:1207.3907 [q-bio.GN]* 2012
53
54If possible, please also refer to the version number provided by freebayes when it is run without arguments or with the `--help` option.
55
56## Install
57
58freebayes is provided as a pre-built 64-bit static Linux binary as part of [releases](https://github.com/freebayes/freebayes/releases).
59
60Debian and Conda packages should work too, see the badges at the top
61of this page.
62
63To build freebayes from source check the
64[development](#Development) section below. It is important to get the full recursive
65git checkout and dependencies.
66
67## Support
68
69Please report any issues or questions to the [freebayes mailing list](https://groups.google.com/forum/#!forum/freebayes). Report bugs on the [freebayes issue tracker](https://github.com/freebayes/freebayes/issues)
70
71## Usage
72
73In its simplest operation, freebayes requires only two inputs: a FASTA reference sequence, and a BAM-format alignment file sorted by reference position.
74For instance:
75
76    freebayes -f ref.fa aln.bam >var.vcf
77
78... will produce a VCF file describing all SNPs, INDELs, and haplotype variants between the reference and aln.bam. The CRAM version is
79
80    freebayes -f ref.fa aln.cram >var.vcf
81
82Multiple BAM files may be given for joint calling.
83
84Typically, we might consider two additional parameters.
85GVCF output allows us to have coverage information about non-called sites, and we can enable it with `--gcvcf`.
86For performance reasons we may want to skip regions of extremely high coverage in the reference using the `--skip-limit` parameter `-g`.
87These can greatly increase runtime but do not produce meaningful results.
88For instance, if we wanted to exclude regions of 1000X coverage, we would run:
89
90    freebayes -f ref.fa aln.bam --gvcf -g 1000 >var.vcf
91
92For a description of available command-line options and their defaults, run:
93
94    freebayes --help
95
96## Examples
97
98Call variants assuming a diploid sample:
99
100    freebayes -f ref.fa aln.bam >var.vcf
101
102Call variants on only chrQ:
103
104    freebayes -f ref.fa -r chrQ aln.bam >var.vcf
105
106Call variants on only chrQ, from position 1000 to 2000:
107
108    freebayes -f ref.fa -r chrQ:1000-2000 aln.bam >var.vcf
109
110Require at least 5 supporting observations to consider a variant:
111
112    freebayes -f ref.fa -C 5 aln.bam >var.vcf
113
114Skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than 200:
115
116    freebayes -f ref.fa -g 200 aln.bam >var.vcf
117
118Use a different ploidy:
119
120    freebayes -f ref.fa -p 4 aln.bam >var.vcf
121
122Assume a pooled sample with a known number of genome copies.  Note that this
123means that each sample identified in the BAM file is assumed to have 32 genome
124copies.  When running with highh --ploidy settings, it may be required to set
125`--use-best-n-alleles` to a low number to limit memory usage.
126
127    freebayes -f ref.fa -p 32 --use-best-n-alleles 4 --pooled-discrete aln.bam >var.vcf
128
129Generate frequency-based calls for all variants passing input thresholds. You'd do
130this in the case that you didn't know the number of samples in the pool.
131
132    freebayes -f ref.fa -F 0.01 -C 1 --pooled-continuous aln.bam >var.vcf
133
134Use an input VCF (bgzipped + tabix indexed) to force calls at particular alleles:
135
136    freebayes -f ref.fa -@ in.vcf.gz aln.bam >var.vcf
137
138Generate long haplotype calls over known variants:
139
140    freebayes -f ref.fa --haplotype-basis-alleles in.vcf.gz \
141                        --haplotype-length 50 aln.bam
142
143Naive variant calling: simply annotate observation counts of SNPs and indels:
144
145    freebayes -f ref.fa --haplotype-length 0 --min-alternate-count 1 \
146        --min-alternate-fraction 0 --pooled-continuous --report-monomorphic >var.vcf
147
148Parallel operation (use 36 cores in this case):
149
150    freebayes-parallel <(fasta_generate_regions.py ref.fa.fai 100000) 36 \
151        -f ref.fa aln.bam >var.vcf
152
153Note that any of the above examples can be made parallel by using the
154scripts/freebayes-parallel script.  If you find freebayes to be slow, you
155should probably be running it in parallel using this script to run on a single
156host, or generating a series of scripts, one per region, and run them on a
157cluster. Be aware that the freebayes-parallel script contains calls to other programs  using relative paths from the scripts subdirectory; the easiest way to ensure a successful run is to invoke the freebayes-parallel script from within the scripts subdirectory.
158
159## Calling variants: from fastq to VCF
160
161You've sequenced some samples.  You have a reference genome or assembled set of
162contigs, and you'd like to determine reference-relative variants in your
163samples.  You can use freebayes to detect the variants, following these steps:
164
165* **Align** your reads to a suitable reference (e.g. with
166[bwa](http://bio-bwa.sourceforge.net/) or
167[MOSAIK](https://github.com/wanpinglee/MOSAIK))
168* Ensure your alignments have **read groups** attached so their sample may be
169identified by freebayes.  Aligners allow you to do this, but you can also use
170[bamaddrg](http://github.com/ekg/bamaddrg) to do so post-alignment.
171* **Sort** the alignments (e.g. [sambamba sort](https://github.com/biod/sambamba)).
172* **Mark duplicates**, for instance with [sambamba markdup](https://github.com/biod/sambamba) (if PCR was used in the preparation of your sequencing library)
173* ***Run freebayes*** on all your alignment data simultaneously, generating a
174VCF.  The default settings should work for most use cases, but if your samples
175are not diploid, set the `--ploidy` and adjust the `--min-alternate-fraction`
176suitably.
177* **Filter** the output e.g. using reported QUAL and/or depth (DP) or
178observation count (AO).
179* **Interpret** your results.
180* (possibly, **Iterate** the variant detection process in response to insight
181gained from your interpretation)
182
183freebayes emits a standard VCF 4.1 output stream.  This format is designed for the
184probabilistic description of allelic variants within a population of samples,
185but it is equally suited to describing the probability of variation in a single
186sample.
187
188Of primary interest to most users is the QUAL field, which estimates the
189probability that there is a polymorphism at the loci described by the record.
190In freebayes, this value can be understood as 1 - P(locus is homozygous given
191the data).  It is recommended that users use this value to filter their
192results, rather than accepting anything output by freebayes as ground truth.
193
194By default, records are output even if they have very low probability of
195variation, in expectation that the VCF will be filtered using tools such as
196[vcffilter](http://github.com/ekg/vcflib#vcffilter) in
197[vcflib](http://github.com/ekg/vcflib), which is also included in the
198repository under `vcflib/`.  For instance,
199
200    freebayes -f ref.fa aln.bam | vcffilter -f "QUAL > 20" >results.vcf
201
202removes any sites with estimated probability of not being polymorphic less than
203phred 20 (aka 0.01), or probability of polymorphism &gt; 0.99.
204
205In simulation, the [receiver-operator
206characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
207 (ROC) tends to have a very sharp inflection between Q1 and Q30, depending on
208input data characteristics, and a filter setting in this range should provide
209decent performance.  Users are encouraged to examine their output and both
210variants which are retained and those they filter out.  Most problems tend to
211occur in low-depth areas, and so users may wish to remove these as well, which
212can also be done by filtering on the DP flag.
213
214
215## Calling variants in a population
216
217freebayes is designed to be run on many individuals from the same population
218(e.g. many human individuals) simultaneously.  The algorithm exploits a neutral
219model of allele diffusion to impute most-confident genotypings
220across the entire population.  In practice, the discriminant power of the
221method will improve if you run multiple samples simultaneously.  In other
222words, if your
223study has multiple individuals, you should run freebayes against them at the
224same time.  This also ensures consistent reporting of information about
225evidence for all samples at any locus where any are apparently polymorphic.
226
227To call variants in a population of samples, each alignment must have a read
228group identifier attached to it (RG tag), and the header of the BAM file in
229which it resides must map the RG tags to sample names (SM).  Furthermore, read
230group IDs must be unique across all the files used in the analysis.  One read
231group cannot map to multiple samples.  The reason this is required is that
232freebayes operates on a virtually merged BAM stream provided by the BamTools
233API.  If merging the files in your analysis using bamtools merge would generate
234a file in which multiple samples map to the same RG, the files are not suitable
235for use in population calling, and they must be modified.
236
237Users may add RG tags to BAM files which were generated without this
238information by using (as mentioned in "Calling variants" above)
239[bamaddrg](http://github.com/ekg/bamaddrg).
240If you have many files corresponding to
241many individuals, add a unique read group and sample name to each, and then
242open them all simultaneously with freebayes.  The VCF output will have one
243column per sample in the input.
244
245
246## Performance tuning
247
248If you find freebayes to be slow, or use large amounts of memory, consider the
249following options:
250
251- Set `--use-best-n-alleles 4`: this will reduce the number of alleles that are
252  considered, which will decrease runtime at the cost of sensitivity to
253lower-frequency alleles at multiallelic loci.  Calculating site qualities
254requires O(samples\*genotypes) runtime, and the number of genotypes is
255exponential in ploidy and the number of alleles that are considered, so this is
256very important when working with high ploidy samples (and also
257`--pooled-discrete`). By default, freebayes puts no limit on this.
258
259- Remove `--genotype-qualities`: calculating genotype qualities requires
260  O(samples\*genotypes) memory.
261
262- Set higher input thresholds. Require that N reads in one sample support an
263  allele in order to consider it: `--min-alternate-count N`, or that the allele
264fraction in one sample is M: `--min-alternate-fraction M`. This will filter
265noisy alleles.  The defaults, `--min-alternate-count 2 --min-alternate-fraction
2660.2`, are most-suitable for diploid, moderate-to-high depth samples, and should
267be changed when working with different ploidy samples. Alternatively,
268`--min-alternate-qsum` can be used to set a specific quality sum, which may be
269more flexible than setting a hard count on the number of observations.
270
271
272## Observation filters and qualities
273
274### Input filters
275
276By default, freebayes doesn't
277
278freebayes may be configured to filter its input so as to ignore low-confidence alignments and alleles which are only supported by low-quality sequencing observations (see `--min-mapping-quality` and `--min-base-quality`).
279It also will only evaluate a position if at least one read has mapping quality of `--min-supporting-mapping-quality` and one allele has quality of at least `--min-supporting-base-quality`.
280
281Reads with more than a fixed number of high-quality mismatches can be excluded by specifying `--read-mismatch-limit`.
282This is meant as a workaround when mapping quality estimates are not appropriately calibrated.
283
284Reads marked as duplicates in the BAM file are ignored, but this can be disabled for testing purposes by providing `--use-duplicate-reads`.
285freebayes does not mark duplicates on its own, you must use another process to do this, such as that in [sambamba](https://github.com/biod/sambamba).
286
287### Observation thresholds
288
289As a guard against spurious variation caused by sequencing artifacts, positions are skipped when no more than `--min-alternate-count` or `--min-alternate-fraction` non-clonal observations of an alternate are found in one sample.
290These default to 2 and 0.05 respectively.
291The default setting of `--min-alternate-fraction 0.05` is suitable for diploid samples but may need to be changed for higher ploidy.
292
293### Allele type exclusion
294freebayes provides a few methods to ignore certain classes of allele, e.g.
295`--throw-away-indels-obs` and `--throw-awary-mnps-obs`.  Users are *strongly cautioned against using
296these*, because removing this information is very likely to reduce detection
297power.  To generate a report only including SNPs, use vcffilter post-call as
298such:
299
300    freebayes ... | vcffilter -f "TYPE = snp"
301
302### Normalizing variant representation
303
304If you wish to obtain a VCF that does not contain haplotype calls or complex alleles, first call with default parameters and then decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, or Picard.
305Here we use a tool in vcflib that normalizes the haplotype calls into pointwise SNPs and indels:
306
307    freebayes ... | vcfallelicprimitives -kg >calls.vcf
308
309Note that this is not done by default as it makes it difficult to determine which variant calls freebayes completed.
310The raw output faithfully describes exactly the calls that were made.
311
312### Observation qualities
313
314freebayes estimates observation quality using several simple heuristics based
315on manipulations of the phred-scaled base qualities:
316
317* For single-base observations, *mismatches* and *reference observations*: the
318un-adjusted base quality provided in the BAM alignment record.
319* For *insertions*: the mean quality of the bases inside of the putatively
320inserted sequence.
321* For *deletions*: the mean quality of the bases flanking the putatively
322deleted sequence.
323* For *haplotypes*: the mean quality of allele observations within the
324haplotype.
325
326By default, both base and mapping quality are into the reported site quality (QUAL in the VCF) and genotype quality (GQ, when supplying `--genotype-qualities`).
327This integration is driven by the "Effective Base Depth" metric first developed in [snpTools](http://www.hgsc.bcm.edu/software/snptools), which scales observation quality by mapping quality: *P(Obs|Genotype) ~ P(MappedCorrectly(Obs))P(SequencedCorrectly(Obs))*.
328Set `--standard-gls` to use the model described in the freebayes preprint.
329
330## Stream processing
331
332freebayes can read BAM from standard input `--stdin` instead of directly from
333files.  This allows the application of any number of streaming BAM filters and
334calibrators to its input.
335
336    bam_merger.sh | streaming_filter_or_process.sh | freebayes --stdin ...
337
338This pattern allows the adjustment of alignments without rewriting BAM files,
339which could be expensive depending on context and available storage.  A prime
340example of this would be graph-based realignment of reads to known variants as
341implemented in [glia](http://github.com/ekg/glia).
342
343Using this pattern, you can filter out reads with certain criteria using
344bamtools filter without having to modify the input BAM file.  You can also use
345the bamtools API to write your own custom filters in C++.  An example filter is
346bamfiltertech
347[src/bamfiltertech.cpp](http://github.com/freebayes/freebayes/blob/master/src/bamfilte
348rtech.cpp), which could be used to filter out
349technologies which have characteristic errors which may frustrate certain types
350of variant detection.
351
352## INDELs
353
354In principle, any gapped aligner which is sensitive to indels will
355produce satisfactory input for use by freebayes.  Due to potential ambiguity,
356indels are
357not parsed when they overlap the beginning or end of alignment boundaries.
358
359When calling indels, it is important to homogenize the positional distribution
360of insertions and deletions in the input by using left realignment.  This is
361now done automatically by freebayes, but the behavior can be turned off via
362`--dont-left-align-indels` flag.  You probably don't want to do this.
363
364Left realignment will place all indels in homopolymer and microsatellite
365repeats at the same position, provided that doing so does not introduce
366mismatches between the read and reference other than the indel.  This method
367computationally inexpensive and handles the most common classes of alignment
368inconsistency.
369
370## Haplotype calls
371
372As freebayes is haplotype-based, left-alignment is necessary only for the
373determination of candidate polymorphic loci.  Once such loci are determined,
374haplotype observations are extracted from reads where:
375
3761. putative variants lie within `--haplotype-length` bases of each other
377(default 3bp),
3782. the reference sequence has repeats (e.g. microsatellites or STRs are called
379as one haplotype),
3803. the haplotype which is called has Shannon entropy less than
381`--min-repeat-entropy`, which is off by default but can be set to ~1 for
382optimal genotyping of indels in lower-complexity sequence.
383
384After a haplotype window is determined by greedily expanding the window across
385overlapping haplotype observations, all reads overlapping the window are used
386to establish data likelihoods, *P(Observations|Genotype)*, for all haplotypes
387which have sufficient support to pass the input filters.
388
389Partial observations are considered to support those haplotypes which they
390could match exactly.  For expedience, only haplotypes which are contiguously
391observed by the reads are considered as putative alleles in this process.  This
392differs from other haplotype-based methods, such as
393[Platypus](http://www.well.ox.ac.uk/platypus), which consider all possible
394haplotypes composed of observed component alleles (SNPs, indels) in a given
395region when generating likelihoods.
396
397The primary adantages of this approach are conceptual simplicity and
398performance, and it is primarily limited in the case of short reads, an issue
399that is mitigated by increasing read lengths.  Also, a hybrid approach must be
400used to call haplotypes from high-error rate long reads.
401
402### Re-genotyping known variants and calling long haplotypes
403
404For longer reads with higher error rates, it is possible to generate long
405haplotypes in two passes over the data.  For instance, if we had very long
406reads (e.g. >10kb) at moderate depth and high error rate (>5%) such as might be
407produced by PacBio, we could do something like:
408
409    freebayes -f ref.fa aln.bam | vcffilter -f "QUAL > 20" >vars.vcf
410
411... thus generating candidate variants of suitable quality using the default
412detection window.  We can then use these as "basis alleles" for the observation
413of haplotypes, considering all other putative variants supported by the
414alignment to be sequencing errors:
415
416    freebayes -f ref.fa --haplotype-length 500 \
417        --haplotype-basis-alleles vars.vcf aln.bam >haps.vcf
418
419These steps should allow us to read long haplotypes directly from input data
420with high error rates.
421
422The high error rate means that beyond a small window each read will contain a
423completely different literal haplotype.  To a point, this property improves our
424signal to noise ratio and can effectively filter out sequencing errors at the
425point of the input filters, but it also decreases the effective observation
426depth will prevent the generation of any calls if a long `--haplotype-length`
427is combined with high a sequencing error rate.
428
429
430## Best practices and design philosophy
431
432freebayes follows the patterns suggested by the [Unix philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which promotes the development of simple, modular systems that perform a single function, and can be combined into more complex systems using stream processing of common interchange formats.
433
434freebayes incorporates a number of features in order to reduce the complexity of variant detection for researchers and developers:
435
436* **Indel realignment is accomplished internally** using a read-independent method, and issues resulting from discordant alignments are dramatically reducedy through the direct detection of haplotypes.
437* The need for **base quality recalibration is avoided** through the direct detection of haplotypes. Sequencing platform errors tend to cluster (e.g. at the ends of reads), and generate unique, non-repeating haplotypes at a given locus.
438* **Variant quality recalibration is avoided** by incorporating a number of metrics, such as read placement bias and allele balance, directly into the Bayesian model.  (Our upcoming publication will discuss this in more detail.)
439
440A minimal pre-processing pipeline similar to that described in "Calling variants" should be sufficient for most uses.
441For more information, please refer to a post by Brad Chapman [on minimal BAM preprocessing methods](http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/).
442
443## Development
444
445To download freebayes, please use git to download the most recent development tree:
446
447    git clone --recursive https://github.com/freebayes/freebayes.git
448
449On Debian you'll need a gcc compiler and want packages:
450
451- bc
452- samtools
453- parallel
454- meson
455- ninja-build
456- libvcflib-tools
457- vcftools
458
459Build dependencies are listed in [guix.scm](./guix.scm) and
460[travis](.travis.yml). Builds have been tested with gcc 7 and clang 9.
461
462## Compilation
463
464Make sure to have dependencies installed and checkout the tree
465with `--recursive`.
466
467Freebayes can target AMD64 and ARM64 (with Neon extensions).
468
469Recently we added the meson build system which can be run with
470
471    meson build/ --buildtype debug
472
473or to setup with clang instead
474
475    env CXX=clang++ CC=clang CC_LD=lld meson build --buildtype debug
476
477Next compile and test in the ~build~ directory
478
479    cd build
480    ninja
481    ninja test
482
483Tests on ARM may be slow. If you get a TIMEOUT use a multiplier,
484e.g.
485
486    meson test -t 4 -C build/
487
488
489### Compile in a Guix container
490
491After checking out the repo with git recursive create a Guix
492container with all the build tools with
493
494    guix environment -C -l guix.scm
495
496See also [guix.scm](./guix.scm).
497