README.licenses
1
2This software constitutes a joint work and the contributions of individual
3authors are subject to different licenses. Contributions and licenses are
4listed in the applicable source files, with specific details on each
5individual contribution captured in the revision control system.
6
7--
8For all code, except as indicated otherwise:
9
10 PUBLIC DOMAIN NOTICE
11
12 This software is "United States Government Work" under the terms of the
13 United States Copyright Act. It was written as part of the authors'
14 official duties for the United States Government and thus cannot be
15 copyrighted. This software is freely available to the public for use
16 without a copyright notice. Restrictions cannot be placed on its present or
17 future use.
18
19 Although all reasonable efforts have been taken to ensure the accuracy and
20 reliability of the software and associated data, the National Human Genome
21 Research Institute (NHGRI), National Institutes of Health (NIH) and the
22 U.S. Government do not and cannot warrant the performance or results that
23 may be obtained by using this software or data. NHGRI, NIH and the
24 U.S. Government disclaim all warranties as to performance, merchantability
25 or fitness for any particular purpose.
26
27 Please cite the authors in any work or product based on this material.
28
29--
30Additional notices can be found in src/utility/README.licenses.
31
README.md
1# seqrequester
2
3This is 'seqrequester', a tool for summarizing, extracting, generating and
4modifying DNA sequences.
5
6# Summarizing
7
8The summarize mode will generate a table of Nx lengths, a lovely ASCII
9plot of the histogram of sequence lengths, report GC content, and di-
10and tri-nucleotide frequencies.
11
12It can optionally split sequences at N's before computing the length of a sequence.
13
14You can also get a simple histogram of the sequence lengths and the number of sequences
15at each length, or just a simple list of all sequence lengths.
16
17It will, of course, read FASTA and FASTQ, uncompressed or compressed with
18gzip, bzip2 or xz.
19
20Only one report is generated, regardless of how many sequence files are supplied.
21
22
23```
24% seqrequester summarize
25usage: seqrequester [mode] [options] [sequence_file ...]
26
27OPTIONS for summarize mode:
28 -size base size to use for N50 statistics
29 -1x limit NG table to 1x coverage
30
31 -split-n split sequences at N bases before computing length
32 -simple output a simple 'length numSequences' histogram
33 -lengths output a list of the sequence lengths
34
35 -assequences load data as complete sequences (for testing)
36 -asbases load data as blocks of bases (for testing)
37```
38
39```
40% seqrequester summarize /archive/mothra/FLX/*gz
41
42G=6462464889 sum of || length num
43NG length index lengths || range seqs
44----- ------------ --------- ------------ || ------------------- -------
4500010 652 801160 646246790 || 42-112 4768|-
4600020 582 1862887 1292493013 || 113-183 16961|-
4700030 555 3002684 1938739802 || 184-254 89381|--
4800040 538 4186751 2584986254 || 255-325 536862|--------
4900050 523 5405461 3231232945 || 326-396 1463599|--------------------
5000060 509 6657839 3877479295 || 397-467 1960924|---------------------------
5100070 488 7952426 4523725460 || 468-538 4616863|---------------------------------------------------------------
5200080 447 9329218 5169971940 || 539-609 2858982|----------------------------------------
5300090 389 10872299 5816218777 || 610-680 625376|---------
5400100 42 12803136 6462464889 || 681-751 252454|----
55001.000x 12803137 6462464889 || 752-822 134849|--
56 || 823-893 78435|--
57 || 894-964 47976|-
58 || 965-1035 30852|-
59 || 1036-1106 21127|-
60 || 1107-1177 14817|-
61 || 1178-1248 28461|-
62 || 1249-1319 4930|-
63 || 1320-1390 3655|-
64 || 1391-1461 2657|-
65 || 1462-1532 2120|-
66 || 1533-1603 1597|-
67 || 1604-1674 1268|-
68 || 1675-1745 953|-
69 || 1746-1816 766|-
70 || 1817-1887 573|-
71 || 1888-1958 443|-
72 || 1959-2029 344|-
73 || 2030-2100 1022|-
74 || 2101-2171 21|-
75 || 2172-2242 23|-
76 || 2243-2313 20|-
77 || 2314-2384 17|-
78 || 2385-2455 8|-
79 || 2456-2526 9|-
80 || 2527-2597 4|-
81 || 2598-2668 2|-
82 || 2669-2739 6|-
83 || 2740-2810 2|-
84 || 2811-2881 6|-
85 || 2882-2952 1|-
86 || 2953-3023 0|
87 || 3024-3094 0|
88 || 3095-3165 1|-
89 || 3166-3236 0|
90 || 3237-3307 0|
91 || 3308-3378 0|
92 || 3379-3449 1|-
93 || 3450-3520 0|
94 || 3521-3591 1|-
95
96--------------------- --------------------- ----------------------------------------------------------------------------------------------
97 mononucleotide dinucleotide trinucleotide
98--------------------- --------------------- ----------------------------------------------------------------------------------------------
99 1959571306 0.3032 A 665030151 0.1031 AA 237235545 0.0369 AAA 132268487 0.0205 AAC 136675399 0.0212 AAG 158473516 0.0246 AAT
100 1247489432 0.1930 C 389352138 0.0604 AC 115665542 0.0180 ACA 87346626 0.0136 ACC 70986769 0.0110 ACG 114582435 0.0178 ACT
101 1345011807 0.2081 G 397219280 0.0616 AG 121659180 0.0189 AGA 65811037 0.0102 AGC 102037062 0.0159 AGG 106854671 0.0166 AGT
102 1910392344 0.2956 T 507072196 0.0786 AT 152454159 0.0237 ATA 89877335 0.0140 ATC 106195089 0.0165 ATG 158544503 0.0246 ATT
103 380831936 0.0590 CA 132169383 0.0205 CAA 76839888 0.0119 CAC 67197045 0.0104 CAG 104566859 0.0162 CAT
104 --GC-- --AT-- 281892951 0.0437 CC 86178881 0.0134 CCA 65022575 0.0101 CCC 50576089 0.0079 CCG 79660170 0.0124 CCT
105 40.12% 59.88% 208535008 0.0323 CG 60164341 0.0093 CGA 27649662 0.0043 CGC 52322022 0.0081 CGG 67296554 0.0105 CGT
106 374626420 0.0581 CT 95122699 0.0148 CTA 75643338 0.0118 CTC 74304266 0.0115 CTG 129554475 0.0201 CTT
107 383528854 0.0595 GA 128291282 0.0199 GAA 70244915 0.0109 GAC 88104990 0.0137 GAG 96746854 0.0150 GAT
108 218253748 0.0338 GC 72696062 0.0113 GCA 51118632 0.0079 GCC 27797659 0.0043 GCG 66512705 0.0103 GCT
109 361154273 0.0560 GG 87662449 0.0136 GGA 51591104 0.0080 GGC 124820908 0.0194 GGG 89162545 0.0139 GGT
110 371793122 0.0576 GT 112773785 0.0175 GTA 61960408 0.0096 GTC 66540524 0.0103 GTG 130508651 0.0203 GTT
111 526153182 0.0816 TA 165978181 0.0258 TAA 109335141 0.0170 TAC 104450634 0.0162 TAG 146068463 0.0227 TAT
112 355583693 0.0551 TC 105474843 0.0164 TCA 77928750 0.0121 TCC 58773061 0.0091 TCG 113158614 0.0176 TCT
113 375551419 0.0582 TG 113251416 0.0176 TGA 72683975 0.0113 TGC 81429936 0.0127 TGG 107781308 0.0167 TGT
114 653083381 0.1013 TT 164899014 0.0256 TTA 127390884 0.0198 TTC 127703170 0.0198 TTG 233082150 0.0362 TTT
115```
116
117# Extracting
118
119```
120% seqrequester extract
121usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
122
123OPTIONS for extract mode:
124 -bases baselist extract bases as specified in the 'list' from each sequence
125 -sequences seqlist extract ordinal sequences as specified in the 'list'
126
127 -reverse reverse the bases in the sequence
128 -complement complement the bases in the sequence
129 -rc alias for -reverse -complement
130
131 -compress compress homopolymer runs to one base
132
133 -upcase
134 -downcase
135
136 -length min-max print sequence if it is at least 'min' bases and at most 'max' bases long
137
138 a 'baselist' is a set of integers formed from any combination
139 of the following, seperated by a comma:
140 num a single number
141 bgn-end a range of numbers: bgn <= end
142 bases are spaced-based; -bases 0-2,4 will print the bases between
143 the first two spaces (the first two bases) and the base after the
144 fourth space (the fifth base).
145
146 a 'seqlist' is a set of integers formed from any combination
147 of the following, seperated by a comma:
148 num a single number
149 bgn-end a range of numbers: bgn <= end
150 sequences are 1-based; -sequences 1,3-5 will print the first, third,
151 fourth and fifth sequences.
152```
153
154# Sampling
155
156```
157% seqrequester sample
158usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
159
160OPTIONS for sample mode:
161 -paired treat inputs as paired sequences; the first two files form the
162 first pair, and so on.
163
164 -copies C write C different copies of the sampling (without replacement).
165 -output O write output sequences to file O. If paired, two files must be supplied.
166
167 -coverage C output C coverage of sequences, based on genome size G.
168 -genomesize G
169
170 -bases B output B bases.
171
172 -reads R output R reads.
173 -pairs P output P pairs (only if -paired).
174
175 -fraction F output fraction F of the input bases.
176
177```
178
179# Generating
180
181Undocumented.
182
183# Simulating
184
185```
186seqrequester simulate
187usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
188
189OPTIONS for simulate mode:
190 -genome G sample reads from these sequences
191 -circular treat the sequences in G as circular
192
193 -genomesize g genome size to use for deciding coverage below
194 -coverage c generate approximately c coverage of output
195 -nreads n generate exactly n reads of output
196 -nbases n generate approximately n bases of output
197
198 -distribution F generate read length by sampling the distribution in file F
199 one column - each line is the length of a sequence
200 two columns - each line has the 'length' and 'number of sequences'
201
202 if file F doesn't exist, use a built-in distribution
203 ultra-long-nanopore
204 pacbio
205 pacbio-hifi
206
207 -length min[-max] (not implemented)
208 -output x.fasta (not implemented)
209```
210