1ntHash 2= 3ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence. 4 5 6# Build the test suite 7 8``` 9$ ./autogen.sh 10$ ./configure 11$ make 12$ sudo make install 13``` 14 15To install nttest in a specified directory: 16 17``` 18$ ./autogen.sh 19$ ./configure --prefix=/opt/ntHash/ 20$ make 21$ make install 22``` 23 24The nttest suite has the options for *runtime* and *uniformity* tests. 25 26## Runtime test 27For the runtime test the program has the following options: 28``` 29nttest [OPTIONS] ... [FILE] 30``` 31Parameters: 32 * `-k`, `--kmer=SIZE`: the length of k-mer used for runtime test hashing `[50]` 33 * `-h`, `--hash=SIZE`: the number of generated hashes for each k-mer `[1]` 34 * `FILE`: is the input fasta or fastq file 35 36For example to evaluate the runtime of different hash methods on the test file `reads.fa` in DATA/ folder for k-mer length `50`, run: 37``` 38$ nttest -k50 reads.fa 39``` 40 41## Uniformity test 42For the uniformity test using the Bloom filter data structure the program has the following options: 43``` 44nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE] 45``` 46 47Parameters: 48 * `-q`, `--qnum=SIZE`: number of queries in query file 49 * `-l`, `--qlen=SIZE`: length of reads in query file 50 * `-t`, `--tnum=SIZE`: number of sequences in reference file 51 * `-g`, `--tlen=SIZE`: length of reference sequence 52 * `-i`, `--input`: generate random query and reference files 53 * `-j`, `threads=SIZE`: number of threads to run uniformity test `[1]` 54 * `REF_FILE`: the reference file name 55 * `QUERY_FILE`: the query file name 56 57For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options: 58 * `100` genes of length `5,000,000bp` as reference in file `genes.fa` 59 * `4,000,000` reads of length `250bp` as query in file `reads.fa` 60 * `12` threads 61 62run: 63``` 64$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa 65``` 66 67## Code samples 68To hash all k-mers of length `k` in a given sequence `seq`: 69```bash 70 string kmer = seq.substr(0, k); 71 uint64_t hVal=0; 72 hVal = NTF64(kmer.c_str(), k); // initial hash value 73 ... 74 for (size_t i = 0; i < seq.length() - k; i++) 75 { 76 hVal = NTF64(hVal, seq[i], seq[i+k], k); // consecutive hash values 77 ... 78 } 79``` 80To canonical hash all k-mers of length `k` in a given sequence `seq`: 81```bash 82 string kmer = seq.substr(0, k); 83 uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values 84 hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value 85 ... 86 for (size_t i = 0; i < seq.length() - k; i++) 87 { 88 hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values 89 ... 90 } 91``` 92To multi-hash with `h` hash values all k-mers of length `k` in a given sequence `seq`: 93```bash 94 string kmer = seq.substr(0, k); 95 uint64_t hVec[h]; 96 NTM64(kmer.c_str(), k, h, hVec); // initial hash vector 97 ... 98 for (size_t i = 0; i < seq.length() - k; i++) 99 { 100 NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors 101 ... 102 } 103``` 104 105# ntHashIterator 106Enables ntHash on sequences 107 108To hash all k-mers of length `k` in a given sequence `seq` with `h` hash values using ntHashIterator: 109```bash 110ntHashIterator itr(seq, h, k); 111while (itr != itr.end()) 112{ 113 ... use *itr ... 114 ++itr; 115} 116``` 117 118## Usage example (C++) 119Outputing hash values of all k-mers in a sequence 120 121```C++ 122#include <iostream> 123#include <string> 124#include "ntHashIterator.hpp" 125 126int main(int argc, const char* argv[]) 127{ 128 /* test sequence */ 129 std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG"; 130 131 /* k is the k-mer length */ 132 unsigned k = 70; 133 134 /* h is the number of hashes for each k-mer */ 135 unsigned h = 1; 136 137 /* init ntHash state and compute hash values for first k-mer */ 138 ntHashIterator itr(seq, h, k); 139 while (itr != itr.end()) { 140 std::cout << (*itr)[0] << std::endl; 141 ++itr; 142 } 143 144 return 0; 145} 146``` 147 148Publications 149============ 150 151## [ntHash](http://bioinformatics.oxfordjournals.org/content/early/2016/08/01/bioinformatics.btw397) 152 153Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. 154**ntHash: recursive nucleotide hashing**. 155*Bioinformatics* (2016) 32 (22): 3492-3494. 156[doi:10.1093/bioinformatics/btw397 ](http://dx.doi.org/10.1093/bioinformatics/btw397) 157 158 159# acknowledgements 160 161This projects uses: 162* [CATCH](https://github.com/philsquared/Catch) unit test framework for C/C++ 163