1# seqrequester 2 3This is 'seqrequester', a tool for summarizing, extracting, generating and 4modifying DNA sequences. 5 6# Summarizing 7 8The summarize mode will generate a table of Nx lengths, a lovely ASCII 9plot of the histogram of sequence lengths, report GC content, and di- 10and tri-nucleotide frequencies. 11 12It can optionally split sequences at N's before computing the length of a sequence. 13 14You can also get a simple histogram of the sequence lengths and the number of sequences 15at each length, or just a simple list of all sequence lengths. 16 17It will, of course, read FASTA and FASTQ, uncompressed or compressed with 18gzip, bzip2 or xz. 19 20Only one report is generated, regardless of how many sequence files are supplied. 21 22 23``` 24% seqrequester summarize 25usage: seqrequester [mode] [options] [sequence_file ...] 26 27OPTIONS for summarize mode: 28 -size base size to use for N50 statistics 29 -1x limit NG table to 1x coverage 30 31 -split-n split sequences at N bases before computing length 32 -simple output a simple 'length numSequences' histogram 33 -lengths output a list of the sequence lengths 34 35 -assequences load data as complete sequences (for testing) 36 -asbases load data as blocks of bases (for testing) 37``` 38 39``` 40% seqrequester summarize /archive/mothra/FLX/*gz 41 42G=6462464889 sum of || length num 43NG length index lengths || range seqs 44----- ------------ --------- ------------ || ------------------- ------- 4500010 652 801160 646246790 || 42-112 4768|- 4600020 582 1862887 1292493013 || 113-183 16961|- 4700030 555 3002684 1938739802 || 184-254 89381|-- 4800040 538 4186751 2584986254 || 255-325 536862|-------- 4900050 523 5405461 3231232945 || 326-396 1463599|-------------------- 5000060 509 6657839 3877479295 || 397-467 1960924|--------------------------- 5100070 488 7952426 4523725460 || 468-538 4616863|--------------------------------------------------------------- 5200080 447 9329218 5169971940 || 539-609 2858982|---------------------------------------- 5300090 389 10872299 5816218777 || 610-680 625376|--------- 5400100 42 12803136 6462464889 || 681-751 252454|---- 55001.000x 12803137 6462464889 || 752-822 134849|-- 56 || 823-893 78435|-- 57 || 894-964 47976|- 58 || 965-1035 30852|- 59 || 1036-1106 21127|- 60 || 1107-1177 14817|- 61 || 1178-1248 28461|- 62 || 1249-1319 4930|- 63 || 1320-1390 3655|- 64 || 1391-1461 2657|- 65 || 1462-1532 2120|- 66 || 1533-1603 1597|- 67 || 1604-1674 1268|- 68 || 1675-1745 953|- 69 || 1746-1816 766|- 70 || 1817-1887 573|- 71 || 1888-1958 443|- 72 || 1959-2029 344|- 73 || 2030-2100 1022|- 74 || 2101-2171 21|- 75 || 2172-2242 23|- 76 || 2243-2313 20|- 77 || 2314-2384 17|- 78 || 2385-2455 8|- 79 || 2456-2526 9|- 80 || 2527-2597 4|- 81 || 2598-2668 2|- 82 || 2669-2739 6|- 83 || 2740-2810 2|- 84 || 2811-2881 6|- 85 || 2882-2952 1|- 86 || 2953-3023 0| 87 || 3024-3094 0| 88 || 3095-3165 1|- 89 || 3166-3236 0| 90 || 3237-3307 0| 91 || 3308-3378 0| 92 || 3379-3449 1|- 93 || 3450-3520 0| 94 || 3521-3591 1|- 95 96--------------------- --------------------- ---------------------------------------------------------------------------------------------- 97 mononucleotide dinucleotide trinucleotide 98--------------------- --------------------- ---------------------------------------------------------------------------------------------- 99 1959571306 0.3032 A 665030151 0.1031 AA 237235545 0.0369 AAA 132268487 0.0205 AAC 136675399 0.0212 AAG 158473516 0.0246 AAT 100 1247489432 0.1930 C 389352138 0.0604 AC 115665542 0.0180 ACA 87346626 0.0136 ACC 70986769 0.0110 ACG 114582435 0.0178 ACT 101 1345011807 0.2081 G 397219280 0.0616 AG 121659180 0.0189 AGA 65811037 0.0102 AGC 102037062 0.0159 AGG 106854671 0.0166 AGT 102 1910392344 0.2956 T 507072196 0.0786 AT 152454159 0.0237 ATA 89877335 0.0140 ATC 106195089 0.0165 ATG 158544503 0.0246 ATT 103 380831936 0.0590 CA 132169383 0.0205 CAA 76839888 0.0119 CAC 67197045 0.0104 CAG 104566859 0.0162 CAT 104 --GC-- --AT-- 281892951 0.0437 CC 86178881 0.0134 CCA 65022575 0.0101 CCC 50576089 0.0079 CCG 79660170 0.0124 CCT 105 40.12% 59.88% 208535008 0.0323 CG 60164341 0.0093 CGA 27649662 0.0043 CGC 52322022 0.0081 CGG 67296554 0.0105 CGT 106 374626420 0.0581 CT 95122699 0.0148 CTA 75643338 0.0118 CTC 74304266 0.0115 CTG 129554475 0.0201 CTT 107 383528854 0.0595 GA 128291282 0.0199 GAA 70244915 0.0109 GAC 88104990 0.0137 GAG 96746854 0.0150 GAT 108 218253748 0.0338 GC 72696062 0.0113 GCA 51118632 0.0079 GCC 27797659 0.0043 GCG 66512705 0.0103 GCT 109 361154273 0.0560 GG 87662449 0.0136 GGA 51591104 0.0080 GGC 124820908 0.0194 GGG 89162545 0.0139 GGT 110 371793122 0.0576 GT 112773785 0.0175 GTA 61960408 0.0096 GTC 66540524 0.0103 GTG 130508651 0.0203 GTT 111 526153182 0.0816 TA 165978181 0.0258 TAA 109335141 0.0170 TAC 104450634 0.0162 TAG 146068463 0.0227 TAT 112 355583693 0.0551 TC 105474843 0.0164 TCA 77928750 0.0121 TCC 58773061 0.0091 TCG 113158614 0.0176 TCT 113 375551419 0.0582 TG 113251416 0.0176 TGA 72683975 0.0113 TGC 81429936 0.0127 TGG 107781308 0.0167 TGT 114 653083381 0.1013 TT 164899014 0.0256 TTA 127390884 0.0198 TTC 127703170 0.0198 TTG 233082150 0.0362 TTT 115``` 116 117# Extracting 118 119``` 120% seqrequester extract 121usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...] 122 123OPTIONS for extract mode: 124 -bases baselist extract bases as specified in the 'list' from each sequence 125 -sequences seqlist extract ordinal sequences as specified in the 'list' 126 127 -reverse reverse the bases in the sequence 128 -complement complement the bases in the sequence 129 -rc alias for -reverse -complement 130 131 -compress compress homopolymer runs to one base 132 133 -upcase 134 -downcase 135 136 -length min-max print sequence if it is at least 'min' bases and at most 'max' bases long 137 138 a 'baselist' is a set of integers formed from any combination 139 of the following, seperated by a comma: 140 num a single number 141 bgn-end a range of numbers: bgn <= end 142 bases are spaced-based; -bases 0-2,4 will print the bases between 143 the first two spaces (the first two bases) and the base after the 144 fourth space (the fifth base). 145 146 a 'seqlist' is a set of integers formed from any combination 147 of the following, seperated by a comma: 148 num a single number 149 bgn-end a range of numbers: bgn <= end 150 sequences are 1-based; -sequences 1,3-5 will print the first, third, 151 fourth and fifth sequences. 152``` 153 154# Sampling 155 156``` 157% seqrequester sample 158usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...] 159 160OPTIONS for sample mode: 161 -paired treat inputs as paired sequences; the first two files form the 162 first pair, and so on. 163 164 -copies C write C different copies of the sampling (without replacement). 165 -output O write output sequences to file O. If paired, two files must be supplied. 166 167 -coverage C output C coverage of sequences, based on genome size G. 168 -genomesize G 169 170 -bases B output B bases. 171 172 -reads R output R reads. 173 -pairs P output P pairs (only if -paired). 174 175 -fraction F output fraction F of the input bases. 176 177``` 178 179# Generating 180 181Undocumented. 182 183# Simulating 184 185``` 186seqrequester simulate 187usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...] 188 189OPTIONS for simulate mode: 190 -genome G sample reads from these sequences 191 -circular treat the sequences in G as circular 192 193 -genomesize g genome size to use for deciding coverage below 194 -coverage c generate approximately c coverage of output 195 -nreads n generate exactly n reads of output 196 -nbases n generate approximately n bases of output 197 198 -distribution F generate read length by sampling the distribution in file F 199 one column - each line is the length of a sequence 200 two columns - each line has the 'length' and 'number of sequences' 201 202 if file F doesn't exist, use a built-in distribution 203 ultra-long-nanopore 204 pacbio 205 pacbio-hifi 206 207 -length min[-max] (not implemented) 208 -output x.fasta (not implemented) 209``` 210