1|Travis| |PyPI| |Coverage| |Depsy|
2
3Description
4-----------
5
6Samtools provides a function "faidx" (FAsta InDeX), which creates a
7small flat index file ".fai" allowing for fast random access to any
8subsequence in the indexed FASTA file, while loading a minimal amount of the
9file in to memory. This python module implements pure Python classes for
10indexing, retrieval, and in-place modification of FASTA files using a samtools
11compatible index. The pyfaidx module is API compatible with the `pygr`_ seqdb module.
12A command-line script "`faidx`_" is installed alongside the pyfaidx module, and
13facilitates complex manipulation of FASTA files without any programming knowledge.
14
15.. _`pygr`: https://github.com/cjlee112/pygr
16
17If you use pyfaidx in your publication, please cite:
18
19`Shirley MD`_, `Ma Z`_, `Pedersen B`_, `Wheelan S`_. `Efficient "pythonic" access to FASTA files using pyfaidx <https://dx.doi.org/10.7287/peerj.preprints.970v1>`_. PeerJ PrePrints 3:e1196. 2015.
20
21.. _`Shirley MD`: http://github.com/mdshw5
22.. _`Ma Z`: http://github.com/azalea
23.. _`Pedersen B`: http://github.com/brentp
24.. _`Wheelan S`: http://github.com/swheelan
25
26Installation
27------------
28
29This package is tested under Linux, MacOS, and Windows using Python 3.2-3.4, 2.7, 2.6, and pypy and is available from the PyPI:
30
31::
32
33 pip install pyfaidx # add --user if you don't have root
34
35or download a `release <https://github.com/mdshw5/pyfaidx/releases>`_ and:
36
37::
38
39 python setup.py install
40
41If using ``pip install --user`` make sure to add ``/home/$(whoami)/.local/bin`` to your ``$PATH`` if you want to run the ``faidx`` script.
42
43Usage
44-----
45
46.. code:: python
47
48 >>> from pyfaidx import Fasta
49 >>> genes = Fasta('tests/data/genes.fasta')
50 >>> genes
51 Fasta("tests/data/genes.fasta") # set strict_bounds=True for bounds checking
52
53Acts like a dictionary.
54
55.. code:: python
56
57 >>> genes.keys()
58 ('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')
59
60 >>> genes['NM_001282543.1'][200:230]
61 >NM_001282543.1:201-230
62 CTCGTTCCGCGCCCGCCATGGAACCGGATG
63
64 >>> genes['NM_001282543.1'][200:230].seq
65 'CTCGTTCCGCGCCCGCCATGGAACCGGATG'
66
67 >>> genes['NM_001282543.1'][200:230].name
68 'NM_001282543.1'
69
70 # Start attributes are 1-based
71 >>> genes['NM_001282543.1'][200:230].start
72 201
73
74 # End attributes are 0-based
75 >>> genes['NM_001282543.1'][200:230].end
76 230
77
78 >>> genes['NM_001282543.1'][200:230].fancy_name
79 'NM_001282543.1:201-230'
80
81 >>> len(genes['NM_001282543.1'])
82 5466
83
84Note that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing ``one_based_attributes=False`` to ``Fasta`` or ``Faidx``. This argument only affects the ``Sequence .start/.end`` attributes, and has no effect on slicing coordinates.
85
86Indexes like a list:
87
88.. code:: python
89
90 >>> genes[0][:50]
91 >AB821309.1:1-50
92 ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC
93
94Slices just like a string:
95
96.. code:: python
97
98 >>> genes['NM_001282543.1'][200:230][:10]
99 >NM_001282543.1:201-210
100 CTCGTTCCGC
101
102 >>> genes['NM_001282543.1'][200:230][::-1]
103 >NM_001282543.1:230-201
104 GTAGGCCAAGGTACCGCCCGCGCCTTGCTC
105
106 >>> genes['NM_001282543.1'][200:230][::3]
107 >NM_001282543.1:201-230
108 CGCCCCTACA
109
110 >>> genes['NM_001282543.1'][:]
111 >NM_001282543.1:1-5466
112 CCCCGCCCCT........
113
114- Slicing start and end coordinates are 0-based, just like Python sequences.
115
116Complements and reverse complements just like DNA
117
118.. code:: python
119
120 >>> genes['NM_001282543.1'][200:230].complement
121 >NM_001282543.1 (complement):201-230
122 GAGCAAGGCGCGGGCGGTACCTTGGCCTAC
123
124 >>> genes['NM_001282543.1'][200:230].reverse
125 >NM_001282543.1:230-201
126 GTAGGCCAAGGTACCGCCCGCGCCTTGCTC
127
128 >>> -genes['NM_001282543.1'][200:230]
129 >NM_001282543.1 (complement):230-201
130 CATCCGGTTCCATGGCGGGCGCGGAACGAG
131
132``Fasta`` objects can also be accessed using method calls:
133
134.. code:: python
135
136 >>> genes.get_seq('NM_001282543.1', 201, 210)
137 >NM_001282543.1:201-210
138 CTCGTTCCGC
139
140 >>> genes.get_seq('NM_001282543.1', 201, 210, rc=True)
141 >NM_001282543.1 (complement):210-201
142 GCGGAACGAG
143
144Spliced sequences can be retrieved from a list of [start, end] coordinates:
145**TODO** update this section
146
147.. code:: python
148
149 # new in v0.5.1
150 segments = [[1, 10], [50, 70]]
151 >>> genes.get_spliced_seq('NM_001282543.1', segments)
152 >gi|543583786|ref|NM_001282543.1|:1-70
153 CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC
154
155.. _keyfn:
156
157Custom key functions provide cleaner access:
158
159.. code:: python
160
161 >>> from pyfaidx import Fasta
162 >>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
163 >>> genes.keys()
164 dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
165 >>> genes['NR_104212'][:10]
166 >NR_104212:1-10
167 CCCCGCCCCT
168
169You can specify a character to split names on, which will generate additional entries:
170
171.. code:: python
172
173 >>> from pyfaidx import Fasta
174 >>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop"
175 >>> genes.keys()
176 dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
177
178If your `key_function` or `split_char` generates duplicate entries, you can choose what action to take:
179
180.. code:: python
181
182 # new in v0.4.9
183 >>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest")
184 >>> genes.keys()
185 dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])
186
187Filter functions (returning True) limit the index:
188
189.. code:: python
190
191 # new in v0.3.8
192 >>> from pyfaidx import Fasta
193 >>> genes = Fasta('tests/data/genes.fasta', filt_function = lambda x: x[0] == 'N')
194 >>> genes.keys()
195 dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548'])
196 >>> genes['XM_005249644']
197 KeyError: XM_005249644 not in tests/data/genes.fasta.
198
199Or just get a Python string:
200
201.. code:: python
202
203 >>> from pyfaidx import Fasta
204 >>> genes = Fasta('tests/data/genes.fasta', as_raw=True)
205 >>> genes
206 Fasta("tests/data/genes.fasta", as_raw=True)
207
208 >>> genes['NM_001282543.1'][200:230]
209 CTCGTTCCGCGCCCGCCATGGAACCGGATG
210
211You can make sure that you always receive an uppercase sequence, even if your fasta file has lower case
212
213.. code:: python
214
215 >>> from pyfaidx import Fasta
216 >>> reference = Fasta('tests/data/genes.fasta.lower', sequence_always_upper=True)
217 >>> reference['gi|557361099|gb|KF435150.1|'][1:70]
218
219 >gi|557361099|gb|KF435150.1|:2-70
220 TGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGACAGTGCTTGCAGGATCTCTCCTGGACAAA
221
222
223You can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file:
224
225.. code:: python
226
227 >>> from pyfaidx import Fasta
228 >>> genes = Fasta('tests/data/genes.fasta')
229 >>> for line in genes['NM_001282543.1']:
230 ... print(line)
231 CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
232 AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
233 CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
234 ...
235
236Sequence names are truncated on any whitespace. This is a limitation of the indexing strategy. However, full names can be recovered:
237
238.. code:: python
239
240 # new in v0.3.7
241 >>> from pyfaidx import Fasta
242 >>> genes = Fasta('tests/data/genes.fasta')
243 >>> for record in genes:
244 ... print(record.name)
245 ... print(record.long_name)
246 ...
247 gi|563317589|dbj|AB821309.1|
248 gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
249 gi|557361099|gb|KF435150.1|
250 gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
251 gi|557361097|gb|KF435149.1|
252 gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
253 ...
254
255 # new in v0.4.9
256 >>> from pyfaidx import Fasta
257 >>> genes = Fasta('tests/data/genes.fasta', read_long_names=True)
258 >>> for record in genes:
259 ... print(record.name)
260 ...
261 gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
262 gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
263 gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
264
265Records can be accessed efficiently as numpy arrays:
266
267.. code:: python
268
269 # new in v0.5.4
270 >>> from pyfaidx import Fasta
271 >>> import numpy as np
272 >>> genes = Fasta('tests/data/genes.fasta')
273 >>> np.asarray(genes['NM_001282543.1'])
274 array(['C', 'C', 'C', ..., 'A', 'A', 'A'], dtype='|S1')
275
276Sequence can be buffered in memory using a read-ahead buffer
277for fast sequential access:
278
279.. code:: python
280
281 >>> from timeit import timeit
282 >>> fetch = "genes['NM_001282543.1'][200:230]"
283 >>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
284 >>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
285 >>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"
286
287 >>> timeit(fetch, no_read_ahead, number=10000)
288 0.2204863309962093
289 >>> timeit(fetch, read_ahead, number=10000)
290 0.1121859749982832
291 >>> timeit(fetch, string_slicing, number=10000)
292 0.0033553699977346696
293
294Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.
295
296.. role:: red
297
298If you want to modify the contents of your FASTA file in-place, you can use the `mutable` argument.
299Any portion of the FastaRecord can be replaced with an equivalent-length string.
300:red:`Warning`: *This will change the contents of your file immediately and permanently:*
301
302.. code:: python
303
304 >>> genes = Fasta('tests/data/genes.fasta', mutable=True)
305 >>> type(genes['NM_001282543.1'])
306 <class 'pyfaidx.MutableFastaRecord'>
307
308 >>> genes['NM_001282543.1'][:10]
309 >NM_001282543.1:1-10
310 CCCCGCCCCT
311 >>> genes['NM_001282543.1'][:10] = 'NNNNNNNNNN'
312 >>> genes['NM_001282543.1'][:15]
313 >NM_001282543.1:1-15
314 NNNNNNNNNNCTGGC
315
316The FastaVariant class provides a way to integrate single nucleotide variant calls to generate a consensus sequence.
317
318.. code:: python
319
320 # new in v0.4.0
321 >>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', het=True, hom=True)
322 RuntimeWarning: Using sample NA06984 genotypes.
323
324 >>> consensus['22'].variant_sites
325 (16042793, 21833121, 29153196, 29187373, 29187448, 29194610, 29821295, 29821332, 29993842, 32330460, 32352284)
326
327 >>> consensus['22'][16042790:16042800]
328 >22:16042791-16042800
329 TCGTAGGACA
330
331 >>> Fasta('tests/data/chr22.fasta')['22'][16042790:16042800]
332 >22:16042791-16042800
333 TCATAGGACA
334
335 >>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', sample='NA06984', het=True, hom=True, call_filter='GT == "0/1"')
336 >>> consensus['22'].variant_sites
337 (16042793, 29187373, 29187448, 29194610, 29821332)
338
339.. _faidx:
340
341It also provides a command-line script:
342
343cli script: faidx
344~~~~~~~~~~~~~~~~~
345
346.. code:: bash
347
348 Fetch sequences from FASTA. If no regions are specified, all entries in the
349 input file are returned. Input FASTA file must be consistently line-wrapped,
350 and line wrapping of output is based on input line lengths.
351
352 positional arguments:
353 fasta FASTA file
354 regions space separated regions of sequence to fetch e.g.
355 chr1:1-1000
356
357 optional arguments:
358 -h, --help show this help message and exit
359 -b BED, --bed BED bed file of regions
360 -o OUT, --out OUT output file name (default: stdout)
361 -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed} transform the requested regions into another format. default: None
362 -c, --complement complement the sequence. default: False
363 -r, --reverse reverse the sequence. default: False
364 -a SIZE_RANGE, --size-range SIZE_RANGE
365 selected sequences are in the size range [low, high]. example: 1,1000 default: None
366 -n, --no-names omit sequence names from output. default: False
367 -f, --full-names output full names including description. default: False
368 -x, --split-files write each region to a separate file (names are derived from regions)
369 -l, --lazy fill in --default-seq for missing ranges. default: False
370 -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
371 default base for missing positions and masking. default: None
372 -d DELIMITER, --delimiter DELIMITER
373 delimiter for splitting names to multiple values (duplicate names will be discarded). default: None
374 -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
375 python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: lambda x: x.split()[0]
376 -u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest}
377 entry to take when duplicate sequence names are encountered. default: stop
378 -g REGEX, --regex REGEX
379 selected sequences are those matching regular expression. default: .*
380 -v, --invert-match selected sequences are those not matching 'regions' argument. default: False
381 -m, --mask-with-default-seq
382 mask the FASTA file using --default-seq default: False
383 -M, --mask-by-case mask the FASTA file by changing to lowercase. default: False
384 -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
385 python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: None
386 --no-rebuild do not rebuild the .fai index even if it is out of date. default: False
387 --version print pyfaidx version number
388
389Examples:
390
391.. code:: bash
392
393 $ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
394 >NM_001282543.1:201-210
395 CTCGTTCCGC
396 >NM_001282543.1:300-320
397 GTAATTGTGTAAGTGACTGCA
398
399 $ faidx --full-names tests/data/genes.fasta NM_001282543.1:201-210
400 >NM_001282543.1| Homo sapiens BRCA1 associated RING domain 1 (BARD1), transcript variant 2, mRNA
401 CTCGTTCCGC
402
403 $ faidx --no-names tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
404 CTCGTTCCGC
405 GTAATTGTGTAAGTGACTGCA
406
407 $ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210
408 >NM_001282543.1:201-210 (complement)
409 GAGCAAGGCG
410
411 $ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210
412 >NM_001282543.1:210-201
413 CGCCTTGCTC
414
415 $ faidx --reverse --complement tests/data/genes.fasta NM_001282543.1:201-210
416 >NM_001282543.1:210-201 (complement)
417 GCGGAACGAG
418
419 $ faidx tests/data/genes.fasta NM_001282543.1
420 >NM_001282543.1:1-5466
421 CCCCGCCCCT........
422 ..................
423 ..................
424 ..................
425
426 $ faidx --regex "^NM_00128254[35]" genes.fasta
427 >NM_001282543.1
428 ..................
429 ..................
430 ..................
431 >NM_001282545.1
432 ..................
433 ..................
434 ..................
435
436 $ faidx --lazy tests/data/genes.fasta NM_001282543.1:5460-5480
437 >NM_001282543.1:5460-5480
438 AAAAAAANNNNNNNNNNNNNN
439
440 $ faidx --lazy --default-seq='Q' tests/data/genes.fasta NM_001282543.1:5460-5480
441 >NM_001282543.1:5460-5480
442 AAAAAAAQQQQQQQQQQQQQQ
443
444 $ faidx tests/data/genes.fasta --bed regions.bed
445 ...
446
447 $ faidx --transform chromsizes tests/data/genes.fasta
448 AB821309.1 3510
449 KF435150.1 481
450 KF435149.1 642
451 NR_104216.1 4573
452 NR_104215.1 5317
453 NR_104212.1 5374
454 ...
455
456 $ faidx --transform bed tests/data/genes.fasta
457 AB821309.1 1 3510
458 KF435150.1 1 481
459 KF435149.1 1 642
460 NR_104216.1 1 4573
461 NR_104215.1 1 5317
462 NR_104212.1 1 5374
463 ...
464
465 $ faidx --transform nucleotide tests/data/genes.fasta
466 name start end A T C G N
467 AB821309.1 1 3510 955 774 837 944 0
468 KF435150.1 1 481 149 120 103 109 0
469 KF435149.1 1 642 201 163 129 149 0
470 NR_104216.1 1 4573 1294 1552 828 899 0
471 NR_104215.1 1 5317 1567 1738 968 1044 0
472 NR_104212.1 1 5374 1581 1756 977 1060 0
473 ...
474
475 faidx --transform transposed tests/data/genes.fasta
476 AB821309.1 1 3510 ATGGTCAGCTGGGGTCGTTTCATC...
477 KF435150.1 1 481 ATGACATCATTTTCCACCTCTGCT...
478 KF435149.1 1 642 ATGACATCATTTTCCACCTCTGCT...
479 NR_104216.1 1 4573 CCCCGCCCCTCTGGCGGCCCGCCG...
480 NR_104215.1 1 5317 CCCCGCCCCTCTGGCGGCCCGCCG...
481 NR_104212.1 1 5374 CCCCGCCCCTCTGGCGGCCCGCCG...
482 ...
483
484 $ faidx --split-files tests/data/genes.fasta
485 $ ls
486 AB821309.1.fasta NM_001282549.1.fasta XM_005249645.1.fasta
487 KF435149.1.fasta NR_104212.1.fasta XM_005265507.1.fasta
488 KF435150.1.fasta NR_104215.1.fasta XM_005265508.1.fasta
489 NM_000465.3.fasta NR_104216.1.fasta XR_241079.1.fasta
490 NM_001282543.1.fasta XM_005249642.1.fasta XR_241080.1.fasta
491 NM_001282545.1.fasta XM_005249643.1.fasta XR_241081.1.fasta
492 NM_001282548.1.fasta XM_005249644.1.fasta
493
494 $ faidx --delimiter='_' tests/data/genes.fasta 000465.3
495 >000465.3
496 CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
497 AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
498 .......
499
500 $ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta
501 NM_000465.3 5523
502
503 $ faidx -m --bed regions.bed tests/data/genes.fasta
504 ### Modifies tests/data/genes.fasta by masking regions using --default-seq character ###
505
506 $ faidx -M --bed regions.bed tests/data/genes.fasta
507 ### Modifies tests/data/genes.fasta by masking regions using lowercase characters ###
508
509 $ faidx -e "lambda x: x.split('.')[0]" tests/data/genes.fasta -i bed
510 AB821309 1 3510
511 KF435150 1 481
512 KF435149 1 642
513 NR_104216 1 4573
514 NR_104215 1 5317
515 .......
516
517
518Similar syntax as ``samtools faidx``
519
520
521A lower-level Faidx class is also available:
522
523.. code:: python
524
525 >>> from pyfaidx import Faidx
526 >>> fa = Faidx('genes.fa') # can return str with as_raw=True
527 >>> fa.index
528 OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ])
529
530 >>> fa.index['AB821309.1'].rlen
531 3510
532
533 fa.fetch('AB821309.1', 1, 10) # these are 1-based genomic coordinates
534 >AB821309.1:1-10
535 ATGGTCAGCT
536
537
538- If the FASTA file is not indexed, when ``Faidx`` is initialized the
539 ``build_index`` method will automatically run, and
540 the index will be written to "filename.fa.fai" with ``write_fai()``.
541 where "filename.fa" is the original FASTA file.
542- Start and end coordinates are 1-based.
543
544Support for compressed FASTA
545----------------------------
546
547``pyfaidx`` can create and read ``.fai`` indices for FASTA files that have
548been compressed using the `bgzip <https://www.htslib.org/doc/bgzip.html>`_
549tool from `samtools <http://www.htslib.org/>`_. ``bgzip`` writes compressed
550data in a ``BGZF`` format. ``BGZF`` is ``gzip`` compatible, consisting of
551multiple concatenated ``gzip`` blocks, each with an additional ``gzip``
552header making it possible to build an index for rapid random access. I.e.,
553files compressed with ``bgzip`` are valid ``gzip`` and so can be read by
554``gunzip``. See `this description
555<http://pydoc.net/Python/biopython/1.66/Bio.bgzf/>`_ for more details on
556``bgzip``.
557
558Changelog
559---------
560
561Please see the `releases <https://github.com/mdshw5/pyfaidx/releases>`_ for a
562comprehensive list of version changes.
563
564Known issues
565------------
566
567I try to fix as many bugs as possible, but most of this work is supported by a single developer. Please check the `known issues <https://github.com/mdshw5/pyfaidx/issues?utf8=✓&q=is%3Aissue+is%3Aopen+label%3Aknown>`_ for bugs relevant to your work. Pull requests are welcome.
568
569
570Contributing
571------------
572
573Create a new Pull Request with one feature. If you add a new feature, please
574create also the relevant test.
575
576To get test running on your machine:
577 - Create a new virtualenv and install the `dev-requirements.txt`.
578 - Download the test data running:
579
580 python tests/data/download_gene_fasta.py
581
582 - Run the tests with
583
584 nosetests --with-coverage --cover-package=pyfaidx
585
586Acknowledgements
587----------------
588
589This project is freely licensed by the author, `Matthew
590Shirley <http://mattshirley.com>`_, and was completed under the
591mentorship and financial support of Drs. `Sarah
592Wheelan <http://sjwheelan.som.jhmi.edu>`_ and `Vasan
593Yegnasubramanian <http://yegnalab.onc.jhmi.edu>`_ at the Sidney Kimmel
594Comprehensive Cancer Center in the Department of Oncology.
595
596.. |Travis| image:: https://travis-ci.org/mdshw5/pyfaidx.svg?branch=master
597 :target: https://travis-ci.org/mdshw5/pyfaidx
598
599.. |PyPI| image:: https://img.shields.io/pypi/v/pyfaidx.svg?branch=master
600 :target: https://pypi.python.org/pypi/pyfaidx
601
602.. |Landscape| image:: https://landscape.io/github/mdshw5/pyfaidx/master/landscape.svg
603 :target: https://landscape.io/github/mdshw5/pyfaidx/master
604 :alt: Code Health
605
606.. |Coverage| image:: https://codecov.io/gh/mdshw5/pyfaidx/branch/master/graph/badge.svg
607 :target: https://codecov.io/gh/mdshw5/pyfaidx
608
609.. |Depsy| image:: http://depsy.org/api/package/pypi/pyfaidx/badge.svg
610 :target: http://depsy.org/package/python/pyfaidx
611
612.. |Appveyor| image:: https://ci.appveyor.com/api/projects/status/80ihlw30a003596w?svg=true
613 :target: https://ci.appveyor.com/project/mdshw5/pyfaidx
614