1|Travis| |PyPI| |Coverage| |Depsy| 2 3Description 4----------- 5 6Samtools provides a function "faidx" (FAsta InDeX), which creates a 7small flat index file ".fai" allowing for fast random access to any 8subsequence in the indexed FASTA file, while loading a minimal amount of the 9file in to memory. This python module implements pure Python classes for 10indexing, retrieval, and in-place modification of FASTA files using a samtools 11compatible index. The pyfaidx module is API compatible with the `pygr`_ seqdb module. 12A command-line script "`faidx`_" is installed alongside the pyfaidx module, and 13facilitates complex manipulation of FASTA files without any programming knowledge. 14 15.. _`pygr`: https://github.com/cjlee112/pygr 16 17If you use pyfaidx in your publication, please cite: 18 19`Shirley MD`_, `Ma Z`_, `Pedersen B`_, `Wheelan S`_. `Efficient "pythonic" access to FASTA files using pyfaidx <https://dx.doi.org/10.7287/peerj.preprints.970v1>`_. PeerJ PrePrints 3:e1196. 2015. 20 21.. _`Shirley MD`: http://github.com/mdshw5 22.. _`Ma Z`: http://github.com/azalea 23.. _`Pedersen B`: http://github.com/brentp 24.. _`Wheelan S`: http://github.com/swheelan 25 26Installation 27------------ 28 29This package is tested under Linux, MacOS, and Windows using Python 3.2-3.4, 2.7, 2.6, and pypy and is available from the PyPI: 30 31:: 32 33 pip install pyfaidx # add --user if you don't have root 34 35or download a `release <https://github.com/mdshw5/pyfaidx/releases>`_ and: 36 37:: 38 39 python setup.py install 40 41If using ``pip install --user`` make sure to add ``/home/$(whoami)/.local/bin`` to your ``$PATH`` if you want to run the ``faidx`` script. 42 43Usage 44----- 45 46.. code:: python 47 48 >>> from pyfaidx import Fasta 49 >>> genes = Fasta('tests/data/genes.fasta') 50 >>> genes 51 Fasta("tests/data/genes.fasta") # set strict_bounds=True for bounds checking 52 53Acts like a dictionary. 54 55.. code:: python 56 57 >>> genes.keys() 58 ('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1') 59 60 >>> genes['NM_001282543.1'][200:230] 61 >NM_001282543.1:201-230 62 CTCGTTCCGCGCCCGCCATGGAACCGGATG 63 64 >>> genes['NM_001282543.1'][200:230].seq 65 'CTCGTTCCGCGCCCGCCATGGAACCGGATG' 66 67 >>> genes['NM_001282543.1'][200:230].name 68 'NM_001282543.1' 69 70 # Start attributes are 1-based 71 >>> genes['NM_001282543.1'][200:230].start 72 201 73 74 # End attributes are 0-based 75 >>> genes['NM_001282543.1'][200:230].end 76 230 77 78 >>> genes['NM_001282543.1'][200:230].fancy_name 79 'NM_001282543.1:201-230' 80 81 >>> len(genes['NM_001282543.1']) 82 5466 83 84Note that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing ``one_based_attributes=False`` to ``Fasta`` or ``Faidx``. This argument only affects the ``Sequence .start/.end`` attributes, and has no effect on slicing coordinates. 85 86Indexes like a list: 87 88.. code:: python 89 90 >>> genes[0][:50] 91 >AB821309.1:1-50 92 ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC 93 94Slices just like a string: 95 96.. code:: python 97 98 >>> genes['NM_001282543.1'][200:230][:10] 99 >NM_001282543.1:201-210 100 CTCGTTCCGC 101 102 >>> genes['NM_001282543.1'][200:230][::-1] 103 >NM_001282543.1:230-201 104 GTAGGCCAAGGTACCGCCCGCGCCTTGCTC 105 106 >>> genes['NM_001282543.1'][200:230][::3] 107 >NM_001282543.1:201-230 108 CGCCCCTACA 109 110 >>> genes['NM_001282543.1'][:] 111 >NM_001282543.1:1-5466 112 CCCCGCCCCT........ 113 114- Slicing start and end coordinates are 0-based, just like Python sequences. 115 116Complements and reverse complements just like DNA 117 118.. code:: python 119 120 >>> genes['NM_001282543.1'][200:230].complement 121 >NM_001282543.1 (complement):201-230 122 GAGCAAGGCGCGGGCGGTACCTTGGCCTAC 123 124 >>> genes['NM_001282543.1'][200:230].reverse 125 >NM_001282543.1:230-201 126 GTAGGCCAAGGTACCGCCCGCGCCTTGCTC 127 128 >>> -genes['NM_001282543.1'][200:230] 129 >NM_001282543.1 (complement):230-201 130 CATCCGGTTCCATGGCGGGCGCGGAACGAG 131 132``Fasta`` objects can also be accessed using method calls: 133 134.. code:: python 135 136 >>> genes.get_seq('NM_001282543.1', 201, 210) 137 >NM_001282543.1:201-210 138 CTCGTTCCGC 139 140 >>> genes.get_seq('NM_001282543.1', 201, 210, rc=True) 141 >NM_001282543.1 (complement):210-201 142 GCGGAACGAG 143 144Spliced sequences can be retrieved from a list of [start, end] coordinates: 145**TODO** update this section 146 147.. code:: python 148 149 # new in v0.5.1 150 segments = [[1, 10], [50, 70]] 151 >>> genes.get_spliced_seq('NM_001282543.1', segments) 152 >gi|543583786|ref|NM_001282543.1|:1-70 153 CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC 154 155.. _keyfn: 156 157Custom key functions provide cleaner access: 158 159.. code:: python 160 161 >>> from pyfaidx import Fasta 162 >>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0]) 163 >>> genes.keys() 164 dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548']) 165 >>> genes['NR_104212'][:10] 166 >NR_104212:1-10 167 CCCCGCCCCT 168 169You can specify a character to split names on, which will generate additional entries: 170 171.. code:: python 172 173 >>> from pyfaidx import Fasta 174 >>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop" 175 >>> genes.keys() 176 dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548']) 177 178If your `key_function` or `split_char` generates duplicate entries, you can choose what action to take: 179 180.. code:: python 181 182 # new in v0.4.9 183 >>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest") 184 >>> genes.keys() 185 dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1']) 186 187Filter functions (returning True) limit the index: 188 189.. code:: python 190 191 # new in v0.3.8 192 >>> from pyfaidx import Fasta 193 >>> genes = Fasta('tests/data/genes.fasta', filt_function = lambda x: x[0] == 'N') 194 >>> genes.keys() 195 dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548']) 196 >>> genes['XM_005249644'] 197 KeyError: XM_005249644 not in tests/data/genes.fasta. 198 199Or just get a Python string: 200 201.. code:: python 202 203 >>> from pyfaidx import Fasta 204 >>> genes = Fasta('tests/data/genes.fasta', as_raw=True) 205 >>> genes 206 Fasta("tests/data/genes.fasta", as_raw=True) 207 208 >>> genes['NM_001282543.1'][200:230] 209 CTCGTTCCGCGCCCGCCATGGAACCGGATG 210 211You can make sure that you always receive an uppercase sequence, even if your fasta file has lower case 212 213.. code:: python 214 215 >>> from pyfaidx import Fasta 216 >>> reference = Fasta('tests/data/genes.fasta.lower', sequence_always_upper=True) 217 >>> reference['gi|557361099|gb|KF435150.1|'][1:70] 218 219 >gi|557361099|gb|KF435150.1|:2-70 220 TGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGACAGTGCTTGCAGGATCTCTCCTGGACAAA 221 222 223You can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file: 224 225.. code:: python 226 227 >>> from pyfaidx import Fasta 228 >>> genes = Fasta('tests/data/genes.fasta') 229 >>> for line in genes['NM_001282543.1']: 230 ... print(line) 231 CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC 232 AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA 233 CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC 234 ... 235 236Sequence names are truncated on any whitespace. This is a limitation of the indexing strategy. However, full names can be recovered: 237 238.. code:: python 239 240 # new in v0.3.7 241 >>> from pyfaidx import Fasta 242 >>> genes = Fasta('tests/data/genes.fasta') 243 >>> for record in genes: 244 ... print(record.name) 245 ... print(record.long_name) 246 ... 247 gi|563317589|dbj|AB821309.1| 248 gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds 249 gi|557361099|gb|KF435150.1| 250 gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced 251 gi|557361097|gb|KF435149.1| 252 gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds 253 ... 254 255 # new in v0.4.9 256 >>> from pyfaidx import Fasta 257 >>> genes = Fasta('tests/data/genes.fasta', read_long_names=True) 258 >>> for record in genes: 259 ... print(record.name) 260 ... 261 gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds 262 gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced 263 gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds 264 265Records can be accessed efficiently as numpy arrays: 266 267.. code:: python 268 269 # new in v0.5.4 270 >>> from pyfaidx import Fasta 271 >>> import numpy as np 272 >>> genes = Fasta('tests/data/genes.fasta') 273 >>> np.asarray(genes['NM_001282543.1']) 274 array(['C', 'C', 'C', ..., 'A', 'A', 'A'], dtype='|S1') 275 276Sequence can be buffered in memory using a read-ahead buffer 277for fast sequential access: 278 279.. code:: python 280 281 >>> from timeit import timeit 282 >>> fetch = "genes['NM_001282543.1'][200:230]" 283 >>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)" 284 >>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')" 285 >>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000" 286 287 >>> timeit(fetch, no_read_ahead, number=10000) 288 0.2204863309962093 289 >>> timeit(fetch, read_ahead, number=10000) 290 0.1121859749982832 291 >>> timeit(fetch, string_slicing, number=10000) 292 0.0033553699977346696 293 294Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions. 295 296.. role:: red 297 298If you want to modify the contents of your FASTA file in-place, you can use the `mutable` argument. 299Any portion of the FastaRecord can be replaced with an equivalent-length string. 300:red:`Warning`: *This will change the contents of your file immediately and permanently:* 301 302.. code:: python 303 304 >>> genes = Fasta('tests/data/genes.fasta', mutable=True) 305 >>> type(genes['NM_001282543.1']) 306 <class 'pyfaidx.MutableFastaRecord'> 307 308 >>> genes['NM_001282543.1'][:10] 309 >NM_001282543.1:1-10 310 CCCCGCCCCT 311 >>> genes['NM_001282543.1'][:10] = 'NNNNNNNNNN' 312 >>> genes['NM_001282543.1'][:15] 313 >NM_001282543.1:1-15 314 NNNNNNNNNNCTGGC 315 316The FastaVariant class provides a way to integrate single nucleotide variant calls to generate a consensus sequence. 317 318.. code:: python 319 320 # new in v0.4.0 321 >>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', het=True, hom=True) 322 RuntimeWarning: Using sample NA06984 genotypes. 323 324 >>> consensus['22'].variant_sites 325 (16042793, 21833121, 29153196, 29187373, 29187448, 29194610, 29821295, 29821332, 29993842, 32330460, 32352284) 326 327 >>> consensus['22'][16042790:16042800] 328 >22:16042791-16042800 329 TCGTAGGACA 330 331 >>> Fasta('tests/data/chr22.fasta')['22'][16042790:16042800] 332 >22:16042791-16042800 333 TCATAGGACA 334 335 >>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', sample='NA06984', het=True, hom=True, call_filter='GT == "0/1"') 336 >>> consensus['22'].variant_sites 337 (16042793, 29187373, 29187448, 29194610, 29821332) 338 339.. _faidx: 340 341It also provides a command-line script: 342 343cli script: faidx 344~~~~~~~~~~~~~~~~~ 345 346.. code:: bash 347 348 Fetch sequences from FASTA. If no regions are specified, all entries in the 349 input file are returned. Input FASTA file must be consistently line-wrapped, 350 and line wrapping of output is based on input line lengths. 351 352 positional arguments: 353 fasta FASTA file 354 regions space separated regions of sequence to fetch e.g. 355 chr1:1-1000 356 357 optional arguments: 358 -h, --help show this help message and exit 359 -b BED, --bed BED bed file of regions 360 -o OUT, --out OUT output file name (default: stdout) 361 -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed} transform the requested regions into another format. default: None 362 -c, --complement complement the sequence. default: False 363 -r, --reverse reverse the sequence. default: False 364 -a SIZE_RANGE, --size-range SIZE_RANGE 365 selected sequences are in the size range [low, high]. example: 1,1000 default: None 366 -n, --no-names omit sequence names from output. default: False 367 -f, --full-names output full names including description. default: False 368 -x, --split-files write each region to a separate file (names are derived from regions) 369 -l, --lazy fill in --default-seq for missing ranges. default: False 370 -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ 371 default base for missing positions and masking. default: None 372 -d DELIMITER, --delimiter DELIMITER 373 delimiter for splitting names to multiple values (duplicate names will be discarded). default: None 374 -e HEADER_FUNCTION, --header-function HEADER_FUNCTION 375 python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: lambda x: x.split()[0] 376 -u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest} 377 entry to take when duplicate sequence names are encountered. default: stop 378 -g REGEX, --regex REGEX 379 selected sequences are those matching regular expression. default: .* 380 -v, --invert-match selected sequences are those not matching 'regions' argument. default: False 381 -m, --mask-with-default-seq 382 mask the FASTA file using --default-seq default: False 383 -M, --mask-by-case mask the FASTA file by changing to lowercase. default: False 384 -e HEADER_FUNCTION, --header-function HEADER_FUNCTION 385 python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: None 386 --no-rebuild do not rebuild the .fai index even if it is out of date. default: False 387 --version print pyfaidx version number 388 389Examples: 390 391.. code:: bash 392 393 $ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320 394 >NM_001282543.1:201-210 395 CTCGTTCCGC 396 >NM_001282543.1:300-320 397 GTAATTGTGTAAGTGACTGCA 398 399 $ faidx --full-names tests/data/genes.fasta NM_001282543.1:201-210 400 >NM_001282543.1| Homo sapiens BRCA1 associated RING domain 1 (BARD1), transcript variant 2, mRNA 401 CTCGTTCCGC 402 403 $ faidx --no-names tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320 404 CTCGTTCCGC 405 GTAATTGTGTAAGTGACTGCA 406 407 $ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210 408 >NM_001282543.1:201-210 (complement) 409 GAGCAAGGCG 410 411 $ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210 412 >NM_001282543.1:210-201 413 CGCCTTGCTC 414 415 $ faidx --reverse --complement tests/data/genes.fasta NM_001282543.1:201-210 416 >NM_001282543.1:210-201 (complement) 417 GCGGAACGAG 418 419 $ faidx tests/data/genes.fasta NM_001282543.1 420 >NM_001282543.1:1-5466 421 CCCCGCCCCT........ 422 .................. 423 .................. 424 .................. 425 426 $ faidx --regex "^NM_00128254[35]" genes.fasta 427 >NM_001282543.1 428 .................. 429 .................. 430 .................. 431 >NM_001282545.1 432 .................. 433 .................. 434 .................. 435 436 $ faidx --lazy tests/data/genes.fasta NM_001282543.1:5460-5480 437 >NM_001282543.1:5460-5480 438 AAAAAAANNNNNNNNNNNNNN 439 440 $ faidx --lazy --default-seq='Q' tests/data/genes.fasta NM_001282543.1:5460-5480 441 >NM_001282543.1:5460-5480 442 AAAAAAAQQQQQQQQQQQQQQ 443 444 $ faidx tests/data/genes.fasta --bed regions.bed 445 ... 446 447 $ faidx --transform chromsizes tests/data/genes.fasta 448 AB821309.1 3510 449 KF435150.1 481 450 KF435149.1 642 451 NR_104216.1 4573 452 NR_104215.1 5317 453 NR_104212.1 5374 454 ... 455 456 $ faidx --transform bed tests/data/genes.fasta 457 AB821309.1 1 3510 458 KF435150.1 1 481 459 KF435149.1 1 642 460 NR_104216.1 1 4573 461 NR_104215.1 1 5317 462 NR_104212.1 1 5374 463 ... 464 465 $ faidx --transform nucleotide tests/data/genes.fasta 466 name start end A T C G N 467 AB821309.1 1 3510 955 774 837 944 0 468 KF435150.1 1 481 149 120 103 109 0 469 KF435149.1 1 642 201 163 129 149 0 470 NR_104216.1 1 4573 1294 1552 828 899 0 471 NR_104215.1 1 5317 1567 1738 968 1044 0 472 NR_104212.1 1 5374 1581 1756 977 1060 0 473 ... 474 475 faidx --transform transposed tests/data/genes.fasta 476 AB821309.1 1 3510 ATGGTCAGCTGGGGTCGTTTCATC... 477 KF435150.1 1 481 ATGACATCATTTTCCACCTCTGCT... 478 KF435149.1 1 642 ATGACATCATTTTCCACCTCTGCT... 479 NR_104216.1 1 4573 CCCCGCCCCTCTGGCGGCCCGCCG... 480 NR_104215.1 1 5317 CCCCGCCCCTCTGGCGGCCCGCCG... 481 NR_104212.1 1 5374 CCCCGCCCCTCTGGCGGCCCGCCG... 482 ... 483 484 $ faidx --split-files tests/data/genes.fasta 485 $ ls 486 AB821309.1.fasta NM_001282549.1.fasta XM_005249645.1.fasta 487 KF435149.1.fasta NR_104212.1.fasta XM_005265507.1.fasta 488 KF435150.1.fasta NR_104215.1.fasta XM_005265508.1.fasta 489 NM_000465.3.fasta NR_104216.1.fasta XR_241079.1.fasta 490 NM_001282543.1.fasta XM_005249642.1.fasta XR_241080.1.fasta 491 NM_001282545.1.fasta XM_005249643.1.fasta XR_241081.1.fasta 492 NM_001282548.1.fasta XM_005249644.1.fasta 493 494 $ faidx --delimiter='_' tests/data/genes.fasta 000465.3 495 >000465.3 496 CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC 497 AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA 498 ....... 499 500 $ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta 501 NM_000465.3 5523 502 503 $ faidx -m --bed regions.bed tests/data/genes.fasta 504 ### Modifies tests/data/genes.fasta by masking regions using --default-seq character ### 505 506 $ faidx -M --bed regions.bed tests/data/genes.fasta 507 ### Modifies tests/data/genes.fasta by masking regions using lowercase characters ### 508 509 $ faidx -e "lambda x: x.split('.')[0]" tests/data/genes.fasta -i bed 510 AB821309 1 3510 511 KF435150 1 481 512 KF435149 1 642 513 NR_104216 1 4573 514 NR_104215 1 5317 515 ....... 516 517 518Similar syntax as ``samtools faidx`` 519 520 521A lower-level Faidx class is also available: 522 523.. code:: python 524 525 >>> from pyfaidx import Faidx 526 >>> fa = Faidx('genes.fa') # can return str with as_raw=True 527 >>> fa.index 528 OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ]) 529 530 >>> fa.index['AB821309.1'].rlen 531 3510 532 533 fa.fetch('AB821309.1', 1, 10) # these are 1-based genomic coordinates 534 >AB821309.1:1-10 535 ATGGTCAGCT 536 537 538- If the FASTA file is not indexed, when ``Faidx`` is initialized the 539 ``build_index`` method will automatically run, and 540 the index will be written to "filename.fa.fai" with ``write_fai()``. 541 where "filename.fa" is the original FASTA file. 542- Start and end coordinates are 1-based. 543 544Support for compressed FASTA 545---------------------------- 546 547``pyfaidx`` can create and read ``.fai`` indices for FASTA files that have 548been compressed using the `bgzip <https://www.htslib.org/doc/bgzip.html>`_ 549tool from `samtools <http://www.htslib.org/>`_. ``bgzip`` writes compressed 550data in a ``BGZF`` format. ``BGZF`` is ``gzip`` compatible, consisting of 551multiple concatenated ``gzip`` blocks, each with an additional ``gzip`` 552header making it possible to build an index for rapid random access. I.e., 553files compressed with ``bgzip`` are valid ``gzip`` and so can be read by 554``gunzip``. See `this description 555<http://pydoc.net/Python/biopython/1.66/Bio.bgzf/>`_ for more details on 556``bgzip``. 557 558Changelog 559--------- 560 561Please see the `releases <https://github.com/mdshw5/pyfaidx/releases>`_ for a 562comprehensive list of version changes. 563 564Known issues 565------------ 566 567I try to fix as many bugs as possible, but most of this work is supported by a single developer. Please check the `known issues <https://github.com/mdshw5/pyfaidx/issues?utf8=✓&q=is%3Aissue+is%3Aopen+label%3Aknown>`_ for bugs relevant to your work. Pull requests are welcome. 568 569 570Contributing 571------------ 572 573Create a new Pull Request with one feature. If you add a new feature, please 574create also the relevant test. 575 576To get test running on your machine: 577 - Create a new virtualenv and install the `dev-requirements.txt`. 578 - Download the test data running: 579 580 python tests/data/download_gene_fasta.py 581 582 - Run the tests with 583 584 nosetests --with-coverage --cover-package=pyfaidx 585 586Acknowledgements 587---------------- 588 589This project is freely licensed by the author, `Matthew 590Shirley <http://mattshirley.com>`_, and was completed under the 591mentorship and financial support of Drs. `Sarah 592Wheelan <http://sjwheelan.som.jhmi.edu>`_ and `Vasan 593Yegnasubramanian <http://yegnalab.onc.jhmi.edu>`_ at the Sidney Kimmel 594Comprehensive Cancer Center in the Department of Oncology. 595 596.. |Travis| image:: https://travis-ci.org/mdshw5/pyfaidx.svg?branch=master 597 :target: https://travis-ci.org/mdshw5/pyfaidx 598 599.. |PyPI| image:: https://img.shields.io/pypi/v/pyfaidx.svg?branch=master 600 :target: https://pypi.python.org/pypi/pyfaidx 601 602.. |Landscape| image:: https://landscape.io/github/mdshw5/pyfaidx/master/landscape.svg 603 :target: https://landscape.io/github/mdshw5/pyfaidx/master 604 :alt: Code Health 605 606.. |Coverage| image:: https://codecov.io/gh/mdshw5/pyfaidx/branch/master/graph/badge.svg 607 :target: https://codecov.io/gh/mdshw5/pyfaidx 608 609.. |Depsy| image:: http://depsy.org/api/package/pypi/pyfaidx/badge.svg 610 :target: http://depsy.org/package/python/pyfaidx 611 612.. |Appveyor| image:: https://ci.appveyor.com/api/projects/status/80ihlw30a003596w?svg=true 613 :target: https://ci.appveyor.com/project/mdshw5/pyfaidx 614