• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

tests/H04-Sep-2019-141127

.gitignoreH A D04-Sep-201914 32

.travis.ymlH A D04-Sep-2019195 1412

Fasta.cppH A D04-Sep-201913.5 KiB352268

Fasta.hH A D04-Sep-20192.8 KiB8669

FastaHack.cppH A D04-Sep-20195.6 KiB180141

LICENSEH A D04-Sep-20191.1 KiB2217

LargeFileSupport.hH A D03-May-2022438 1715

MakefileH A D04-Sep-2019855 4327

READMEH A D04-Sep-20192.9 KiB7150

Region.hH A D04-Sep-20191.6 KiB5243

disorder.cH A D04-Sep-20194.4 KiB193117

disorder.hH A D04-Sep-20192.4 KiB639

libdisorder.LICENSEH A D04-Sep-201917.9 KiB340281

split.cppH A D04-Sep-2019956 3429

split.hH A D04-Sep-2019781 2111

README

1fastahack --- *fast* FASTA file indexing, subsequence and sequence extraction
2
3Author: Erik Garrison <erik.garrison@bc.edu>, Marth Lab, Boston College
4Date:   May 7, 2010
5
6
7Overview:
8
9fastahack is a small application for indexing and extracting sequences and
10subsequences from FASTA files.  The included Fasta.cpp library provides a FASTA
11reader and indexer that can be embedded into applications which would benefit
12from directly reading subsequences from FASTA files.  The library automatically
13handles index file generation and use.
14
15
16Features:
17
18 - FASTA index (.fai) generation for FASTA files
19 - Sequence extraction
20 - Subsequence extraction
21 - Sequence statistics (TODO: currently only entropy is provided)
22
23Sequence and subsequence extraction use fseek64 to provide fastest-possible
24extraction without RAM-intensive file loading operations.  This makes fastahack
25a useful tool for bioinformaticists who need to quickly extract many
26subsequences from a reference FASTA sequence.
27
28
29Notes:
30
31The index files generated by this system should be numerically equivalent to
32those generated by samtools (http://samtools.sourceforge.net/).  However, while
33samtools truncates sequence names in the index file, fastahack provides them
34completely.
35
36To simplify use, sequences can be addressed by first whitespace-separated
37field; e.g. "8 SN(Homo sapiens) GA(HG18) URI(NC_000008.9)" can be addressed
38simply as "8", provided "8" is a unique first-field name in the FASTA file.
39Thus, to extract 20bp starting at position 323202 in chromosome 8 from the
40human reference:
41
42  % fastahack -r 8:323202..20 h.sapiens.fasta
43  ACATTGTAATAGATCTCAGA
44
45Usage information is provided by running fastahack with no arguments:
46
47  % usage: fastahack [options] <fasta reference>
48
49  options:
50      -i, --index          generate fasta index <fasta reference>.fai
51      -r, --region REGION  print the specified region
52      -c, --stdin          read a stream of line-delimited region specifiers on stdin
53                           and print the corresponding sequence for each on stdout
54      -e, --entropy        print the shannon entropy of the specified region
55
56  REGION is of the form <seq>, <seq>:<start>..<end>, <seq1>:<start>..<seq2>:<end>
57  where start and end are 1-based, and the region includes the end position.
58  Specifying a sequence name alone will return the entire sequence, specifying
59  range will return that range, and specifying a single coordinate pair, e.g.
60  <seq>:<start> will return just that base.
61
62
63Limitations:
64
65fastahack will only generate indexes for FASTA files in which the sequences
66have self-consistent line lengths.  Trailing whitespace is allowed at the end
67of sequences, but not embedded within the sequence.  These limitations are
68necessitated by the complexity of indexing sequences whose lines change in
69length--- the use of indexes is frustrated by such inconsistencies; each change
70in line length would require a new entry in the index file.
71