1###############
2General usage
3###############
4
5=======================
6Supported file formats
7=======================
8
9----------------------
10BED format
11----------------------
12As described on the UCSC Genome Browser website (see link below), the browser extensible data (BED) format is a concise and
13flexible way to represent genomic features and annotations. The BED format description supports up to
1412 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for
15bedtools. bedtools allows one to use the "BED12" format (that is, all 12 fields listed below).
16However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12
17"blocks" when computing overlaps, etc., via the **"-split"** option. For all other tools, the last six columns
18are not used for any comparisons by the bedtools. Instead, they will use the entire span (start to end)
19of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported
20in the output of all comparisons.
21
22The file description below is modified from: http://genome.ucsc.edu/FAQ/FAQformat#format1.
23
241. **chrom** - The name of the chromosome on which the genome feature exists.
25
26  - *Any string can be used*. For example, "chr1", "III", "myChrom", "contig1112.23".
27  - *This column is required*.
28
292. **start** - The zero-based starting position of the feature in the chromosome.
30
31 - *The first base in a chromosome is numbered 0*.
32 - *The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20,inclusive*.
33 - *This column is required*.
34
353. **end** - The one-based ending position of the feature in the chromosome.
36
37 - *The end position in each BED feature is one-based. See example above*.
38 - *This column is required*.
39
404. **name** - Defines the name of the BED feature.
41
42 - *Any string can be used*. For example, "LINE", "Exon3", "HWIEAS_0001:3:1:0:266#0/1", or "my_Feature".
43 - *This column is optional*.
44
455. **score** - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. However, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features. For example, strings allow scientific notation for p-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.
46
47 - *Any string can be used*. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), "up", "down", etc.
48 - *This column is optional*.
49
506. **strand** - Defines the strand - either '+' or '-'.
51
52 - *This column is optional*.
53
547. **thickStart** - The starting position at which the feature is drawn thickly.
55
56 - *Allowed yet ignored by bedtools*.
57
588. **thickEnd** - The ending position at which the feature is drawn thickly.
59
60 - *Allowed yet ignored by bedtools*.
61
629. **itemRgb** - An RGB value of the form R,G,B (e.g. 255,0,0).
63
64 - *Allowed yet ignored by bedtools*.
65
6610. **blockCount** - The number of blocks (exons) in the BED line.
67
68 - *Allowed yet ignored by bedtools*.
69
7011. **blockSizes** - A comma-separated list of the block sizes.
71
72
7312. **blockStarts** - A comma-separated list of block starts.
74
75 - *Allowed yet ignored by bedtools*.
76
77
78bedtools requires that all BED input files (and input received from stdin) are **tab-delimited**. The following types of BED files are supported by bedtools:
79
80
811.  **BED3**: A BED file where each feature is described by **chrom**, **start**, and **end**.
82
83  For example: ``chr1          11873   14409``
84
852.  **BED4**: A BED file where each feature is described by **chrom**, **start**, **end**, and **name**.
86
87  For example: ``chr1  11873  14409  uc001aaa.3``
88
893.  **BED5**: A BED file where each feature is described by **chrom**, **start**, **end**, **name**, and **score**.
90
91  For example: ``chr1 11873 14409 uc001aaa.3 0``
92
934.  **BED6**: A BED file where each feature is described by **chrom**, **start**, **end**, **name**, **score**, and **strand**.
94
95  For example: ``chr1 11873 14409 uc001aaa.3 0 +``
96
975.  **BED12**: A BED file where each feature is described by all twelve columns listed above.
98
99    For example: ``chr1 11873 14409 uc001aaa.3 0 + 11873 11873 0 3 354,109,1189, 0,739,1347,``
100
101----------------------
102BEDPE format
103----------------------
104We have defined a new file format, the browser extensible data paired-end (BEDPE) format, in order to concisely describe disjoint genome features,
105such as structural variations or paired-end sequence alignments. We chose to define a new format
106because the existing "blocked" BED format (a.k.a. BED12) does not allow inter-chromosomal feature
107definitions. In addition, BED12 only has one strand field, which is insufficient for paired-end sequence
108alignments, especially when studying structural variation.
109
110The BEDPE format is described below. The description is modified from: http://genome.ucsc.edu/FAQ/FAQformat#format1.
111
1121. **chrom1** - The name of the chromosome on which the **first** end of the feature exists.
113
114 - *Any string can be used*. For example, "chr1", "III", "myChrom", "contig1112.23".
115 - *This column is required*.
116 - *Use "." for unknown*.
117
1182. **start1** - The zero-based starting position of the **first** end of the feature on **chrom1**.
119
120 - *The first base in a chromosome is numbered 0*.
121 - *As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required*.
122 - *Use -1 for unknown*.
123
1243. **end1** - The one-based ending position of the first end of the feature on **chrom1**.
125
126 - *The end position in each BEDPE feature is one-based*.
127 - *This column is required*.
128 - *Use -1 for unknown*.
129
1304. **chrom2** - The name of the chromosome on which the **second** end of the feature exists.
131
132 - *Any string can be used*. For example, "chr1", "III", "myChrom", "contig1112.23".
133 - *This column is required*.
134 - *Use "." for unknown*.
135
1365. **start2** - The zero-based starting position of the **second** end of the feature on **chrom2**.
137
138 - *The first base in a chromosome is numbered 0*.
139 - *As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required*.
140 - *Use -1 for unknown*.
141
1426. **end2** - The one-based ending position of the **second** end of the feature on **chrom2**.
143
144 - *The end position in each BEDPE feature is one-based*.
145 - *This column is required*.
146 - *Use -1 for unknown*.
147
1487. **name** - Defines the name of the BEDPE feature.
149
150 - *Any string can be used*. For example, "LINE", "Exon3", "HWIEAS_0001:3:1:0:266#0/1", or "my_Feature".
151 - *This column is optional*.
152
1538. **score** - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. *However, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features*. For example, strings allow scientific notation for p-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.
154
155 - *Any string can be used*. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), "up", "down", etc.
156 - *This column is optional*.
157
1589. **strand1** - Defines the strand for the first end of the feature. Either '+' or '-'.
159
160 - *This column is optional*.
161 - *Use "." for unknown*.
162
16310. **strand2** - Defines the strand for the second end of the feature. Either '+' or '-'.
164
165 - *This column is optional*.
166 - *Use "." for unknown*.
167
16811. **Any number of additional, user-defined fields** - bedtools allows one to add as many additional fields to the normal, 10-column BEDPE format as necessary. These columns are merely "passed through" **pairToBed** and **pairToPair** and are not part of any analysis. One would use these additional columns to add extra information (e.g., edit distance for each end of an alignment, or "deletion", "inversion", etc.) to each BEDPE feature.
169
170 - *These additional columns are optional*.
171
172
173Entries from an typical BEDPE file:
174::
175
176  chr1  100   200   chr5  5000  5100  bedpe_example1  30   +  -
177  chr9  1000  5000  chr9  3000  3800  bedpe_example2  100  +  -
178
179
180Entries from a BEDPE file with two custom fields added to each record:
181::
182
183  chr1  10    20    chr5  50    60    a1     30       +    -  0  1
184  chr9  30    40    chr9  80    90    a2     100      +    -  2  1
185
186
187
188----------------------
189GFF format
190----------------------
191The GFF format is described on the Sanger Institute's website (http://www.sanger.ac.uk/resources/software/gff/spec.html). The GFF description below is modified from the definition at this URL. All nine columns in the GFF format description are required by bedtools.
192
1931. **seqname** - The name of the sequence (e.g. chromosome) on which the feature exists.
194
195 - *Any string can be used*. For example, "chr1", "III", "myChrom", "contig1112.23".
196 - *This column is required*.
197
1982. **source** - The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.
199
200 - *This column is required*.
201
2023. **feature** - The feature type name. Equivalent to BED's **name** field.
203
204 - *Any string can be used*. For example, "exon", etc.
205 - *This column is required*.
206
2074. **start** - The one-based starting position of feature on **seqname**.
208
209 - *This column is required*.
210 - *bedtools accounts for the fact the GFF uses a one-based position and BED uses a zero-based start position*.
211
2125. **end** - The one-based ending position of feature on **seqname**.
213
214 - *This column is required*.
215
2166. **score** - A score assigned to the GFF feature. Like BED format, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features. We note that this differs from the GFF definition in the interest of flexibility.
217
218 - *This column is required*.
219
2207. **strand** - Defines the strand. Use '+', '-' or '.'
221
222 - *This column is required*.
223
2248. **frame** -  The frame of the coding sequence. Use '0', '1', '2', or '.'.
225
226 - *This column is required*.
227
2289. **attribute** - Taken from http://www.sanger.ac.uk/resources/software/gff/spec.html: From version 2 onwards, the attribute field must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Free text values must be quoted with double quotes. *Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as '\n', tabs as '\t')*. As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required).
229
230 - *This column is required*.
231
232An entry from an example GFF file :
233
234::
235
236  seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ;
237  E_value 0.0003 dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence
238  "dJ102G20.C1.1"
239
240
241
242------------------------
243*Genome* file format
244------------------------
245Some of the bedtools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of
246the chromosomes for the organism for which your BED files are based. When using the UCSC Genome
247Browser, Ensemble, or Galaxy, you typically indicate which which species/genome build you are
248working. The way you do this for bedtools is to create a "genome" file, which simply lists the names of
249the chromosomes (or scaffolds, etc.) and their size (in basepairs).
250
251
252Genome files must be **tab-delimited** and are structured as follows (this is an example for *C. elegans*):
253
254::
255
256  chrI  15072421
257  chrII 15279323
258  ...
259  chrX  17718854
260  chrM  13794
261
262bedtools includes pre-defined genome files for human and mouse in the **/genomes** directory included
263in the bedtools distribution.
264
265
266----------------------
267SAM/BAM format
268----------------------
269The SAM / BAM format is a powerful and widely-used format for storing sequence alignment data (see
270http://samtools.sourceforge.net/ for more details). It has quickly become the standard format to which
271most DNA sequence alignment programs write their output. Currently, the following bedtools
272support input in BAM format: ``intersect``, ``window``, ``coverage``, ``genomecov``,
273``pairtobed``, ``bamtobed``. Support for the BAM format in bedtools allows one to (to name a few):
274compare sequence alignments to annotations, refine alignment datasets, screen for potential mutations
275and compute aligned sequence coverage.
276
277
278
279----------------------
280VCF format
281----------------------
282The Variant Call Format (VCF) was conceived as part of the 1000 Genomes Project as a standardized
283means to report genetic variation calls from SNP, INDEL and structural variant detection programs
284(see http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0 for details).
285bedtools now supports the latest version of this format (i.e, Version 4.0). As a result, bedtools can
286be used to compare genetic variation calls with other genomic features.
287