1# seqrequester
2
3This is 'seqrequester', a tool for summarizing, extracting, generating and
4modifying DNA sequences.
5
6# Summarizing
7
8The summarize mode will generate a table of Nx lengths, a lovely ASCII
9plot of the histogram of sequence lengths, report GC content, and di-
10and tri-nucleotide frequencies.
11
12It can optionally split sequences at N's before computing the length of a sequence.
13
14You can also get a simple histogram of the sequence lengths and the number of sequences
15at each length, or just a simple list of all sequence lengths.
16
17It will, of course, read FASTA and FASTQ, uncompressed or compressed with
18gzip, bzip2 or xz.
19
20Only one report is generated, regardless of how many sequence files are supplied.
21
22
23```
24% seqrequester summarize
25usage: seqrequester [mode] [options] [sequence_file ...]
26
27OPTIONS for summarize mode:
28  -size          base size to use for N50 statistics
29  -1x            limit NG table to 1x coverage
30
31  -split-n       split sequences at N bases before computing length
32  -simple        output a simple 'length numSequences' histogram
33  -lengths       output a list of the sequence lengths
34
35  -assequences   load data as complete sequences (for testing)
36  -asbases       load data as blocks of bases    (for testing)
37```
38
39```
40% seqrequester summarize /archive/mothra/FLX/*gz
41
42G=6462464889                       sum of  ||               length     num
43NG         length     index       lengths  ||                range    seqs
44----- ------------ --------- ------------  ||  ------------------- -------
4500010          652    801160    646246790  ||         42-112          4768|-
4600020          582   1862887   1292493013  ||        113-183         16961|-
4700030          555   3002684   1938739802  ||        184-254         89381|--
4800040          538   4186751   2584986254  ||        255-325        536862|--------
4900050          523   5405461   3231232945  ||        326-396       1463599|--------------------
5000060          509   6657839   3877479295  ||        397-467       1960924|---------------------------
5100070          488   7952426   4523725460  ||        468-538       4616863|---------------------------------------------------------------
5200080          447   9329218   5169971940  ||        539-609       2858982|----------------------------------------
5300090          389  10872299   5816218777  ||        610-680        625376|---------
5400100           42  12803136   6462464889  ||        681-751        252454|----
55001.000x            12803137   6462464889  ||        752-822        134849|--
56                                           ||        823-893         78435|--
57                                           ||        894-964         47976|-
58                                           ||        965-1035        30852|-
59                                           ||       1036-1106        21127|-
60                                           ||       1107-1177        14817|-
61                                           ||       1178-1248        28461|-
62                                           ||       1249-1319         4930|-
63                                           ||       1320-1390         3655|-
64                                           ||       1391-1461         2657|-
65                                           ||       1462-1532         2120|-
66                                           ||       1533-1603         1597|-
67                                           ||       1604-1674         1268|-
68                                           ||       1675-1745          953|-
69                                           ||       1746-1816          766|-
70                                           ||       1817-1887          573|-
71                                           ||       1888-1958          443|-
72                                           ||       1959-2029          344|-
73                                           ||       2030-2100         1022|-
74                                           ||       2101-2171           21|-
75                                           ||       2172-2242           23|-
76                                           ||       2243-2313           20|-
77                                           ||       2314-2384           17|-
78                                           ||       2385-2455            8|-
79                                           ||       2456-2526            9|-
80                                           ||       2527-2597            4|-
81                                           ||       2598-2668            2|-
82                                           ||       2669-2739            6|-
83                                           ||       2740-2810            2|-
84                                           ||       2811-2881            6|-
85                                           ||       2882-2952            1|-
86                                           ||       2953-3023            0|
87                                           ||       3024-3094            0|
88                                           ||       3095-3165            1|-
89                                           ||       3166-3236            0|
90                                           ||       3237-3307            0|
91                                           ||       3308-3378            0|
92                                           ||       3379-3449            1|-
93                                           ||       3450-3520            0|
94                                           ||       3521-3591            1|-
95
96--------------------- --------------------- ----------------------------------------------------------------------------------------------
97       mononucleotide          dinucleotide                                                                                  trinucleotide
98--------------------- --------------------- ----------------------------------------------------------------------------------------------
99  1959571306 0.3032 A   665030151 0.1031 AA   237235545 0.0369 AAA    132268487 0.0205 AAC    136675399 0.0212 AAG    158473516 0.0246 AAT
100  1247489432 0.1930 C   389352138 0.0604 AC   115665542 0.0180 ACA     87346626 0.0136 ACC     70986769 0.0110 ACG    114582435 0.0178 ACT
101  1345011807 0.2081 G   397219280 0.0616 AG   121659180 0.0189 AGA     65811037 0.0102 AGC    102037062 0.0159 AGG    106854671 0.0166 AGT
102  1910392344 0.2956 T   507072196 0.0786 AT   152454159 0.0237 ATA     89877335 0.0140 ATC    106195089 0.0165 ATG    158544503 0.0246 ATT
103                        380831936 0.0590 CA   132169383 0.0205 CAA     76839888 0.0119 CAC     67197045 0.0104 CAG    104566859 0.0162 CAT
104      --GC--  --AT--    281892951 0.0437 CC    86178881 0.0134 CCA     65022575 0.0101 CCC     50576089 0.0079 CCG     79660170 0.0124 CCT
105      40.12%  59.88%    208535008 0.0323 CG    60164341 0.0093 CGA     27649662 0.0043 CGC     52322022 0.0081 CGG     67296554 0.0105 CGT
106                        374626420 0.0581 CT    95122699 0.0148 CTA     75643338 0.0118 CTC     74304266 0.0115 CTG    129554475 0.0201 CTT
107                        383528854 0.0595 GA   128291282 0.0199 GAA     70244915 0.0109 GAC     88104990 0.0137 GAG     96746854 0.0150 GAT
108                        218253748 0.0338 GC    72696062 0.0113 GCA     51118632 0.0079 GCC     27797659 0.0043 GCG     66512705 0.0103 GCT
109                        361154273 0.0560 GG    87662449 0.0136 GGA     51591104 0.0080 GGC    124820908 0.0194 GGG     89162545 0.0139 GGT
110                        371793122 0.0576 GT   112773785 0.0175 GTA     61960408 0.0096 GTC     66540524 0.0103 GTG    130508651 0.0203 GTT
111                        526153182 0.0816 TA   165978181 0.0258 TAA    109335141 0.0170 TAC    104450634 0.0162 TAG    146068463 0.0227 TAT
112                        355583693 0.0551 TC   105474843 0.0164 TCA     77928750 0.0121 TCC     58773061 0.0091 TCG    113158614 0.0176 TCT
113                        375551419 0.0582 TG   113251416 0.0176 TGA     72683975 0.0113 TGC     81429936 0.0127 TGG    107781308 0.0167 TGT
114                        653083381 0.1013 TT   164899014 0.0256 TTA    127390884 0.0198 TTC    127703170 0.0198 TTG    233082150 0.0362 TTT
115```
116
117# Extracting
118
119```
120% seqrequester extract
121usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
122
123OPTIONS for extract mode:
124  -bases     baselist extract bases as specified in the 'list' from each sequence
125  -sequences seqlist  extract ordinal sequences as specified in the 'list'
126
127  -reverse            reverse the bases in the sequence
128  -complement         complement the bases in the sequence
129  -rc                 alias for -reverse -complement
130
131  -compress           compress homopolymer runs to one base
132
133  -upcase
134  -downcase
135
136  -length min-max     print sequence if it is at least 'min' bases and at most 'max' bases long
137
138                      a 'baselist' is a set of integers formed from any combination
139                      of the following, seperated by a comma:
140                           num       a single number
141                           bgn-end   a range of numbers:  bgn <= end
142                      bases are spaced-based; -bases 0-2,4 will print the bases between
143                      the first two spaces (the first two bases) and the base after the
144                      fourth space (the fifth base).
145
146                      a 'seqlist' is a set of integers formed from any combination
147                      of the following, seperated by a comma:
148                           num       a single number
149                           bgn-end   a range of numbers:  bgn <= end
150                      sequences are 1-based; -sequences 1,3-5 will print the first, third,
151                      fourth and fifth sequences.
152```
153
154# Sampling
155
156```
157% seqrequester sample
158usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
159
160OPTIONS for sample mode:
161  -paired             treat inputs as paired sequences; the first two files form the
162                      first pair, and so on.
163
164  -copies C           write C different copies of the sampling (without replacement).
165  -output O           write output sequences to file O.  If paired, two files must be supplied.
166
167  -coverage C         output C coverage of sequences, based on genome size G.
168  -genomesize G
169
170  -bases B            output B bases.
171
172  -reads R            output R reads.
173  -pairs P            output P pairs (only if -paired).
174
175  -fraction F         output fraction F of the input bases.
176
177```
178
179# Generating
180
181Undocumented.
182
183# Simulating
184
185```
186seqrequester simulate
187usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
188
189OPTIONS for simulate mode:
190  -genome G           sample reads from these sequences
191  -circular           treat the sequences in G as circular
192
193  -genomesize g       genome size to use for deciding coverage below
194  -coverage c         generate approximately c coverage of output
195  -nreads n           generate exactly n reads of output
196  -nbases n           generate approximately n bases of output
197
198  -distribution F     generate read length by sampling the distribution in file F
199                        one column  - each line is the length of a sequence
200                        two columns - each line has the 'length' and 'number of sequences'
201
202                      if file F doesn't exist, use a built-in distribution
203                        ultra-long-nanopore
204                        pacbio
205                        pacbio-hifi
206
207  -length min[-max]   (not implemented)
208  -output x.fasta     (not implemented)
209```
210