• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..21-Apr-2021-

lib/H21-Apr-2021-2,3101,658

unittest/H21-Apr-2021-14485

vendor/H21-Apr-2021-10,2038,115

CITATION.bibH A D21-Apr-2021483 1312

ChangeLogH A D21-Apr-2021247 106

LICENSEH A D21-Apr-20211 KiB2217

Makefile.amH A D21-Apr-2021431 3124

README.mdH A D21-Apr-20214.4 KiB163134

autogen.shH A D21-Apr-202158 75

azure-pipelines.ymlH A D21-Apr-2021464 1817

configure.acH A D21-Apr-20211.9 KiB9276

ntHashIterator.hppH A D21-Apr-20213.3 KiB15492

nthash.hppH A D21-Apr-202128.9 KiB682559

nttest.cppH A D21-Apr-202119.6 KiB652582

ssHashIterator.hppH A D21-Apr-20212.5 KiB12568

sstest.cppH A D21-Apr-2021608 2819

stHashIterator.hppH A D21-Apr-20214.2 KiB177102

sttest.cppH A D21-Apr-20211.6 KiB5439

README.md

1ntHash
2=
3ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.
4
5
6# Build the test suite
7
8```
9$ ./autogen.sh
10$ ./configure
11$ make
12$ sudo make install
13```
14
15To install nttest in a specified directory:
16
17```
18$ ./autogen.sh
19$ ./configure --prefix=/opt/ntHash/
20$ make
21$ make install
22```
23
24The nttest suite has the options for *runtime* and *uniformity* tests.
25
26## Runtime test
27For the runtime test the program has the following options:
28```
29nttest [OPTIONS] ... [FILE]
30```
31Parameters:
32  * `-k`,  `--kmer=SIZE`: the length of k-mer used for runtime test hashing `[50]`
33  * `-h`,  `--hash=SIZE`: the number of generated hashes for each k-mer `[1]`
34  * `FILE`: is the input fasta or fastq file
35
36For example to evaluate the runtime of different hash methods on the test file `reads.fa` in DATA/ folder for k-mer length `50`, run:
37```
38$ nttest -k50 reads.fa
39```
40
41## Uniformity test
42For the uniformity test using the Bloom filter data structure the program has the following options:
43```
44nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]
45```
46
47Parameters:
48  * `-q`, `--qnum=SIZE`: number of queries in query file
49  * `-l`, `--qlen=SIZE`: length of reads in query file
50  * `-t`, `--tnum=SIZE`: number of sequences in reference file
51  * `-g`, `--tlen=SIZE`: length of reference sequence
52  * `-i`, `--input`: generate random query and reference files
53  * `-j`, `threads=SIZE`: number of threads to run uniformity test `[1]`
54  * `REF_FILE`: the reference file name
55  * `QUERY_FILE`: the query file name
56
57For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:
58  * `100` genes of length `5,000,000bp` as reference in file `genes.fa`
59  * `4,000,000` reads of length `250bp` as query in file `reads.fa`
60  * `12` threads
61
62run:
63```
64$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa
65```
66
67## Code samples
68To hash all k-mers of length `k` in a given sequence `seq`:
69```bash
70    string kmer = seq.substr(0, k);
71    uint64_t hVal=0;
72    hVal = NTF64(kmer.c_str(), k); // initial hash value
73    ...
74    for (size_t i = 0; i < seq.length() - k; i++)
75    {
76        hVal = NTF64(hVal, seq[i], seq[i+k], k); // consecutive hash values
77        ...
78    }
79```
80To canonical hash all k-mers of length `k` in a given sequence `seq`:
81```bash
82    string kmer = seq.substr(0, k);
83    uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
84    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
85    ...
86    for (size_t i = 0; i < seq.length() - k; i++)
87    {
88        hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
89        ...
90    }
91```
92To multi-hash with `h` hash values all k-mers of length `k` in a given sequence `seq`:
93```bash
94    string kmer = seq.substr(0, k);
95    uint64_t hVec[h];
96    NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
97    ...
98    for (size_t i = 0; i < seq.length() - k; i++)
99    {
100        NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
101        ...
102    }
103```
104
105# ntHashIterator
106Enables ntHash on sequences
107
108To hash all k-mers of length `k` in a given sequence `seq` with `h` hash values using ntHashIterator:
109```bash
110ntHashIterator itr(seq, h, k);
111while (itr != itr.end())
112{
113 ... use *itr ...
114 ++itr;
115}
116```
117
118## Usage example (C++)
119Outputing hash values of all k-mers in a sequence
120
121```C++
122#include <iostream>
123#include <string>
124#include "ntHashIterator.hpp"
125
126int main(int argc, const char* argv[])
127{
128	/* test sequence */
129	std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
130
131	/* k is the k-mer length */
132	unsigned k = 70;
133
134	/* h is the number of hashes for each k-mer */
135	unsigned h = 1;
136
137	/* init ntHash state and compute hash values for first k-mer */
138	ntHashIterator itr(seq, h, k);
139	while (itr != itr.end()) {
140		std::cout << (*itr)[0] << std::endl;
141		++itr;
142	}
143
144	return 0;
145}
146```
147
148Publications
149============
150
151## [ntHash](http://bioinformatics.oxfordjournals.org/content/early/2016/08/01/bioinformatics.btw397)
152
153Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol.
154**ntHash: recursive nucleotide hashing**.
155*Bioinformatics* (2016) 32 (22): 3492-3494.
156[doi:10.1093/bioinformatics/btw397 ](http://dx.doi.org/10.1093/bioinformatics/btw397)
157
158
159# acknowledgements
160
161This projects uses:
162* [CATCH](https://github.com/philsquared/Catch) unit test framework for C/C++
163