• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

libBigWig/H07-Feb-2021-4,5223,125

pyBigWig.egg-info/H03-May-2022-

pyBigWigTest/H03-May-2022-332266

MANIFEST.inH A D07-Feb-2021118 54

PKG-INFOH A D07-Feb-2021412 1413

README.mdH A D07-Feb-202118.3 KiB325207

pyBigWig.cH A D07-Feb-202157.2 KiB1,9561,676

pyBigWig.hH A D07-Feb-202118 KiB458446

setup.cfgH A D07-Feb-202179 85

setup.pyH A D07-Feb-20213.6 KiB9178

README.md

1[![PyPI version](https://badge.fury.io/py/pyBigWig.svg)](https://badge.fury.io/py/pyBigWig) [![Travis-CI status](https://travis-ci.org/deeptools/pyBigWig.svg?branch=master)](https://travis-ci.org/dpryan79/pyBigWig.svg?branch=master) [![bioconda-badge](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io) [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.45238.svg)](http://dx.doi.org/10.5281/zenodo.45238)
2
3# pyBigWig
4A python extension, written in C, for quick access to bigBed files and access to and creation of bigWig files. This extension uses [libBigWig](https://github.com/dpryan79/libBigWig) for local and remote file access.
5
6Table of Contents
7=================
8
9  * [Installation](#installation)
10    * [Requirements](#requirements)
11  * [Usage](#usage)
12    * [Load the extension](#load-the-extension)
13    * [Open a bigWig or bigBed file](#open-a-bigwig-or-bigbed-file)
14    * [Determining the file type](#determining-the-file-type)
15    * [Access the list of chromosomes and their lengths](#access-the-list-of-chromosomes-and-their-lengths)
16    * [Print the header](#print-the-header)
17    * [Compute summary information on a range](#compute-summary-information-on-a-range)
18      * [A note on statistics and zoom levels](#a-note-on-statistics-and-zoom-levels)
19    * [Retrieve values for individual bases in a range](#retrieve-values-for-individual-bases-in-a-range)
20    * [Retrieve all intervals in a range](#retrieve-all-intervals-in-a-range)
21    * [Retrieving bigBed entries](#retrieving-bigbed-entries)
22    * [Add a header to a bigWig file](#add-a-header-to-a-bigwig-file)
23    * [Adding entries to a bigWig file](#adding-entries-to-a-bigwig-file)
24    * [Close a bigWig or bigBed file](#close-a-bigwig-or-bigbed-file)
25  * [Numpy](#numpy)
26  * [Remote file access](#remote-file-access)
27  * [Empty files](#empty-files)
28  * [A note on coordinates](#a-note-on-coordinates)
29  * [Galaxy](#galaxy)
30
31# Installation
32You can install this extension directly from github with:
33
34    pip install pyBigWig
35
36or with conda
37
38    conda install pybigwig -c conda-forge -c bioconda
39
40## Requirements
41
42The follow non-python requirements must be installed:
43
44 - libcurl (and the `curl-config` config)
45 - zlib
46
47The headers and libraries for these are required.
48
49# Usage
50Basic usage is as follows:
51
52## Load the extension
53
54    >>> import pyBigWig
55
56## Open a bigWig or bigBed file
57
58This will work if your working directory is the pyBigWig source code directory.
59
60    >>> bw = pyBigWig.open("test/test.bw")
61
62Note that if the file doesn't exist you'll see an error message and `None` will be returned. Be default, all files are opened for reading and not writing. You can alter this by passing a mode containing `w`:
63
64    >>> bw = pyBigWig.open("test/output.bw", "w")
65
66Note that a file opened for writing can't be queried for its intervals or statistics, it can *only* be written to. If you open a file for writing then you will next need to add a header (see the section on this below).
67
68Local and remote bigBed read access is also supported:
69
70    >>> bb = pyBigWig.open("https://www.encodeproject.org/files/ENCFF001JBR/@@download/ENCFF001JBR.bigBed")
71
72While you can specify a mode for bigBed files, it is ignored. The object returned by `pyBigWig.open()` is the same regardless of whether you're opening a bigWig or bigBed file.
73
74## Determining the file type
75
76Since bigWig and bigBed files can both be opened, it may be necessary to determine whether a given `bigWigFile` object points to a bigWig or bigBed file. To that end, one can use the `isBigWig()` and `isBigBed()` functions:
77
78    >>> bw = pyBigWig.open("test/test.bw")
79    >>> bw.isBigWig()
80    True
81    >>> bw.isBigBed()
82    False
83
84## Access the list of chromosomes and their lengths
85
86`bigWigFile` objects contain a dictionary holding the chromosome lengths, which can be accessed with the `chroms()` accessor.
87
88    >>> bw.chroms()
89    dict_proxy({'1': 195471971L, '10': 130694993L})
90
91You can also directly query a particular chromosome.
92
93    >>> bw.chroms("1")
94    195471971L
95
96The lengths are stored a the "long" integer type, which is why there's an `L` suffix. If you specify a non-existant chromosome then nothing is output.
97
98    >>> bw.chroms("c")
99    >>>
100
101## Print the header
102
103It's sometimes useful to print a bigWig's header. This is presented here as a python dictionary containing: the version (typically `4`), the number of zoom levels (`nLevels`), the number of bases described (`nBasesCovered`), the minimum value (`minVal`), the maximum value (`maxVal`), the sum of all values (`sumData`), and the sum of all squared values (`sumSquared`). The last two of these are needed for determining the mean and standard deviation.
104
105    >>> bw.header()
106    {'maxVal': 2L, 'sumData': 272L, 'minVal': 0L, 'version': 4L, 'sumSquared': 500L, 'nLevels': 1L, 'nBasesCovered': 154L}
107
108Note that this is also possible for bigBed files and the same dictionary keys will be present. Entries such as `maxVal`, `sumData`, `minVal`, and `sumSquared` are then largely not meaningful.
109
110## Compute summary information on a range
111
112bigWig files are used to store values associated with positions and ranges of them. Typically we want to quickly access the average value over a range, which is very simple:
113
114    >>> bw.stats("1", 0, 3)
115    [0.2000000054637591]
116
117Suppose instead of the mean value, we instead wanted the maximum value:
118
119    >>> bw.stats("1", 0, 3, type="max")
120    [0.30000001192092896]
121
122Other options are "min" (the minimum value), "coverage" (the fraction of bases covered), and "std" (the standard deviation of the values).
123
124It's often the case that we would instead like to compute values of some number of evenly spaced bins in a given interval, which is also simple:
125
126    >>> bw.stats("1",99, 200, type="max", nBins=2)
127    [1.399999976158142, 1.5]
128
129`nBins` defaults to 1, just as `type` defaults to `mean`.
130
131If the start and end positions are omitted then the entire chromosome is used:
132
133    >>> bw.stats("1")
134    [1.3351851569281683]
135
136### A note on statistics and zoom levels
137
138> A note to the lay reader: This section is rather technical and included only for the sake of completeness. The summary is that if your needs require exact mean/max/etc. summary values for an interval or intervals and that a small trade-off in speed is acceptable, that you should use the `exact=True` option in the `stats()` function.
139
140By default, there are some unintuitive aspects to computing statistics on ranges in a bigWig file. The bigWig format was originally created in the context of genome browsers. There, computing exact summary statistics for a given interval is less important than quickly being able to compute an approximate statistic (after all, browsers need to be able to quickly display a number of contiguous intervals and support scrolling/zooming). Because of this, bigWig files contain not only interval-value associations, but also `sum of values`/`sum of squared values`/`minimum value`/`maximum value`/`number of bases covered` for equally sized bins of various sizes. These different sizes are referred to as "zoom levels". The smallest zoom level has bins that are 16 times the mean interval size in the file and each subsequent zoom level has bins 4 times larger than the previous. This methodology is used in Kent's tools and, therefore, likely used in almost every currently existing bigWig file.
141
142When a bigWig file is queried for a summary statistic, the size of the interval is used to determine whether to use a zoom level and, if so, which one. The optimal zoom level is that which has the largest bins no more than half the width of the desired interval. If no such zoom level exists, the original intervals are instead used for the calculation.
143
144For the sake of consistency with other tools, pyBigWig adopts this same methodology. However, since this is (A) unintuitive and (B) undesirable in some applications, pyBigWig enables computation of exact summary statistics regardless of the interval size (i.e., it allows ignoring the zoom levels). This was originally proposed [here](https://github.com/dpryan79/pyBigWig/issues/12) and an example is below:
145
146    >>> import pyBigWig
147    >>> from numpy import mean
148    >>> bw = pyBigWig.open("http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign75mer.bigWig")
149    >>> bw.stats('chr1', 89294, 91629)
150    [0.20120902053804418]
151    >>> mean(bw.values('chr1', 89294, 91629))
152    0.22213841940688142
153    >>> bw.stats('chr1', 89294, 91629, exact=True)
154    [0.22213841940688142]
155
156## Retrieve values for individual bases in a range
157
158While the `stats()` method **can** be used to retrieve the original values for each base (e.g., by setting `nBins` to the number of bases), it's preferable to instead use the `values()` accessor.
159
160    >>> bw.values("1", 0, 3)
161    [0.10000000149011612, 0.20000000298023224, 0.30000001192092896]
162
163The list produced will always contain one value for every base in the range specified. If a particular base has no associated value in the bigWig file then the returned value will be `nan`.
164
165    >>> bw.values("1", 0, 4)
166    [0.10000000149011612, 0.20000000298023224, 0.30000001192092896, nan]
167
168## Retrieve all intervals in a range
169
170Sometimes it's convenient to retrieve all entries overlapping some range. This can be done with the `intervals()` function:
171
172    >>> bw.intervals("1", 0, 3)
173    ((0, 1, 0.10000000149011612), (1, 2, 0.20000000298023224), (2, 3, 0.30000001192092896))
174
175What's returned is a list of tuples containing: the start position, end end position, and the value. Thus, the example above has values of `0.1`, `0.2`, and `0.3` at positions `0`, `1`, and `2`, respectively.
176
177If the start and end position are omitted then all intervals on the chromosome specified are returned:
178
179    >>> bw.intervals("1")
180    ((0, 1, 0.10000000149011612), (1, 2, 0.20000000298023224), (2, 3, 0.30000001192092896), (100, 150, 1.399999976158142), (150, 151, 1.5))
181
182## Retrieving bigBed entries
183
184As opposed to bigWig files, bigBed files hold entries, which are intervals with an associated string. You can access these entries using the `entries()` function:
185
186    >>> bb = pyBigWig.open("https://www.encodeproject.org/files/ENCFF001JBR/@@download/ENCFF001JBR.bigBed")
187    >>> bb.entries('chr1', 10000000, 10020000)
188    [(10009333, 10009640, '61035\t130\t-\t0.026\t0.42\t404'), (10014007, 10014289, '61047\t136\t-\t0.029\t0.42\t404'), (10014373, 10024307, '61048\t630\t-\t5.420\t0.00\t2672399')]
189
190The output is a list of entry tuples. The tuple elements are the `start` and `end` position of each entry, followed by its associated `string`. The string is returned exactly as it's held in the bigBed file, so parsing it is left to you. To determine what the various fields are in these string, consult the SQL string:
191
192    >>> bb.SQL()
193    table RnaElements
194    "BED6 + 3 scores for RNA Elements data"
195        (
196        string chrom;      "Reference sequence chromosome or scaffold"
197        uint   chromStart; "Start position in chromosome"
198        uint   chromEnd;   "End position in chromosome"
199        string name;       "Name of item"
200        uint   score;      "Normalized score from 0-1000"
201        char[1] strand;    "+ or - or . for unknown"
202        float level;       "Expression level such as RPKM or FPKM. Set to -1 for no data."
203        float signif;      "Statistical significance such as IDR. Set to -1 for no data."
204        uint score2;       "Additional measurement/count e.g. number of reads. Set to 0 for no data."
205        )
206
207Note that the first three entries in the SQL string are not part of the string.
208
209If you only need to know where entries are and not their associated values, you can save memory by additionally specifying `withString=False` in `entries()`:
210
211    >>> bb.entries('chr1', 10000000, 10020000, withString=False)
212    [(10009333, 10009640), (10014007, 10014289), (10014373, 10024307)]
213
214## Add a header to a bigWig file
215
216If you've opened a file for writing then you'll need to give it a header before you can add any entries. The header contains all of the chromosomes, **in order**, and their sizes. If your chromosome has two chromosomes, chr1 and chr2, of lengths 1 and 1.5 million bases, then the following would add an appropriate header:
217
218    >>> bw.addHeader([("chr1", 1000000), ("chr2", 1500000)])
219
220bigWig headers are case-sensitive, so `chr1` and `Chr1` are different. Likewise, `1` and `chr1` are not the same, so you can't mix Ensembl and UCSC chromosome names. After adding a header, you can then add entries.
221
222By default, up to 10 "zoom levels" are constructed for bigWig files. You can change this default number with the `maxZooms` optional argument. A common use of this is to create a bigWig file that simply holds intervals and no zoom levels:
223
224    >>> bw.addHeader([("chr1", 1000000), ("chr2", 1500000)], maxZooms=0)
225
226If you set `maxTooms=0`, please note that IGV and many other tools WILL NOT WORK as they assume that at least one zoom level will be present. You are advised to use the default unless you do not expect the bigWig files to be used by other packages.
227
228## Adding entries to a bigWig file
229
230Assuming you've opened a file for writing and added a header, you can then add entries. Note that the entries **must** be added in order, as bigWig files always contain ordered intervals. There are three formats that bigWig files can use internally to store entries. The most commonly observed format is identical to a [bedGraph](https://genome.ucsc.edu/goldenpath/help/bedgraph.html) file:
231
232    chr1	0	100	0.0
233    chr1	100	120	1.0
234    chr1	125	126	200.0
235
236These entries would be added as follows:
237
238    >>> bw.addEntries(["chr1", "chr1", "chr1"], [0, 100, 125], ends=[5, 120, 126], values=[0.0, 1.0, 200.0])
239
240Each entry occupies 12 bytes before compression.
241
242The second format uses a fixed span, but a variable step size between entries. These can be represented in a [wiggle](http://genome.ucsc.edu/goldenpath/help/wiggle.html) file as:
243
244    variableStep chrom=chr1 span=20
245    500	-2.0
246    600	150.0
247    635	25.0
248
249The above entries describe (1-based) positions 501-520, 601-620 and 636-655. These would be added as follows:
250
251    >>> bw.addEntries("chr1", [500, 600, 635], values=[-2.0, 150.0, 25.0], span=20)
252
253Each entry of this type occupies 8 bytes before compression.
254
255The final format uses a fixed step and span for each entry, corresponding to the fixedStep [wiggle format](http://genome.ucsc.edu/goldenpath/help/wiggle.html):
256
257    fixedStep chrom=chr1 step=30 span=20
258    -5.0
259    -20.0
260    25.0
261
262The above entries describe (1-based) bases 901-920, 931-950 and 961-980 and would be added as follows:
263
264    >>> bw.addEntries("chr1", 900, values=[-5.0, -20.0, 25.0], span=20, step=30)
265
266Each entry of this type occupies 4 bytes.
267
268Note that pyBigWig will try to prevent you from adding entries in an incorrect order. This, however, requires additional over-head. Should that not be acceptable, you can simply specify `validate=False` when adding entries:
269
270    >>> bw.addEntries(["chr1", "chr1", "chr1"], [100, 0, 125], ends=[120, 5, 126], values=[0.0, 1.0, 200.0], validate=False)
271
272You're obviously then responsible for ensuring that you **do not** add entries out of order. The resulting files would otherwise largley not be usable.
273
274## Close a bigWig or bigBed file
275
276A file can be closed with a simple `bw.close()`, as is commonly done with other file types. For files opened for writing, closing a file writes any buffered entries to disk, constructs and writes the file index, and constructs zoom levels. Consequently, this can take a bit of time.
277
278# Numpy
279
280As of version 0.3.0, pyBigWig supports input of coordinates using numpy integers and vectors in some functions **if numpy was installed prior to installing pyBigWig**. To determine if pyBigWig was installed with numpy support by checking the `numpy` accessor:
281
282    >>> import pyBigWig
283    >>> pyBigWig.numpy
284    1
285
286If `pyBigWig.numpy` is `1`, then pyBigWig was compiled with numpy support. This means that `addEntries()` can accept numpy coordinates:
287
288    >>> import pyBigWig
289    >>> import numpy
290    >>> bw = pyBigWig.open("/tmp/delete.bw", "w")
291    >>> bw.addHeader([("1", 1000)], maxZooms=0)
292    >>> chroms = np.array(["1"] * 10)
293    >>> starts = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=np.int64)
294    >>> ends = np.array([5, 15, 25, 35, 45, 55, 65, 75, 85, 95], dtype=np.int64)
295    >>> values0 = np.array(np.random.random_sample(10), dtype=np.float64)
296    >>> bw.addEntries(chroms, starts, ends=ends, values=values0)
297    >>> bw.close()
298
299Additionally, `values()` can directly output a numpy vector:
300
301    >>> bw = bw.open("/tmp/delete.bw")
302    >>> bw.values('1', 0, 10, numpy=True)
303    [ 0.74336642  0.74336642  0.74336642  0.74336642  0.74336642         nan
304         nan         nan         nan         nan]
305    >>> type(bw.values('1', 0, 10, numpy=True))
306    <type 'numpy.ndarray'>
307
308# Remote file access
309
310If you do not have curl installed, pyBigWig will be installed without the ability to access remote files. You can determine if you will be able to access remote files with `pyBigWig.remote`. If that returns 1, then you can access remote files. If it returns 0 then you can't.
311
312# Empty files
313
314As of version 0.3.5, pyBigWig is able to read and write bigWig files lacking entries. Please note that such files are generally not compatible with other programs, since there's no definition of how a bigWig file with no entries should look. For such a file, the `intervals()` accessor will return `None`, the `stats()` function will return a list of `None` of the desired length, and `values()` will return `[]` (an empty list). This should generally allow programs utilizing pyBigWig to continue without issue.
315
316For those wishing to mimic the functionality of pyBigWig/libBigWig in this regard, please note that it looks at the number of bases covered (as reported in the file header) to check for "empty" files.
317
318# A note on coordinates
319
320Wiggle, bigWig, and bigBed files use 0-based half-open coordinates, which are also used by this extension. So to access the value for the first base on `chr1`, one would specify the starting position as `0` and the end position as `1`. Similarly, bases 100 to 115 would have a start of `99` and an end of `115`. This is simply for the sake of consistency with the underlying bigWig file and may change in the future.
321
322# Galaxy
323
324pyBigWig is also available as a package in [Galaxy](http://www.usegalaxy.org). You can find it in the toolshed and the [IUC](https://wiki.galaxyproject.org/IUC) is currently hosting the XML definition of this on [github](https://github.com/galaxyproject/tools-iuc/tree/master/packages/package_python_2_7_10_pybigwig_0_2_8).
325