vendor/nthash/README.md

ntHash
=
ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.


# Build the test suite

```
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
```

To install nttest in a specified directory:

```
$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install
```

The nttest suite has the options for *runtime* and *uniformity* tests.

## Runtime test
For the runtime test the program has the following options:
```
nttest [OPTIONS] ... [FILE]
```
Parameters:
  * `-k`,  `--kmer=SIZE`: the length of k-mer used for runtime test hashing `[50]`
  * `-h`,  `--hash=SIZE`: the number of generated hashes for each k-mer `[1]`
  * `FILE`: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file `reads.fa` in DATA/ folder for k-mer length `50`, run:
```
$ nttest -k50 reads.fa
```

## Uniformity test
For the uniformity test using the Bloom filter data structure the program has the following options:
```
nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]
```

Parameters:
  * `-q`, `--qnum=SIZE`: number of queries in query file
  * `-l`, `--qlen=SIZE`: length of reads in query file
  * `-t`, `--tnum=SIZE`: number of sequences in reference file
  * `-g`, `--tlen=SIZE`: length of reference sequence
  * `-i`, `--input`: generate random query and reference files
  * `-j`, `threads=SIZE`: number of threads to run uniformity test `[1]`
  * `REF_FILE`: the reference file name
  * `QUERY_FILE`: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:
  * `100` genes of length `5,000,000bp` as reference in file `genes.fa`
  * `4,000,000` reads of length `250bp` as query in file `reads.fa`
  * `12` threads

run:
```
$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa
```

## Code samples
To hash all k-mers of length `k` in a given sequence `seq`:
```bash
    string kmer = seq.substr(0, k);
    uint64_t hVal=0;
    hVal = NTF64(kmer.c_str(), k); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++)
    {
        hVal = NTF64(hVal, seq[i], seq[i+k], k); // consecutive hash values
        ...
    }
```
To canonical hash all k-mers of length `k` in a given sequence `seq`:
```bash
    string kmer = seq.substr(0, k);
    uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++)
    {
        hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
        ...
    }
```
To multi-hash with `h` hash values all k-mers of length `k` in a given sequence `seq`:
```bash
    string kmer = seq.substr(0, k);
    uint64_t hVec[h];
    NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
    ...
    for (size_t i = 0; i < seq.length() - k; i++)
    {
        NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
        ...
    }
```

# ntHashIterator
Enables ntHash on sequences

To hash all k-mers of length `k` in a given sequence `seq` with `h` hash values using ntHashIterator:
```bash
ntHashIterator itr(seq, h, k);
while (itr != itr.end())
{
 ... use *itr ...
 ++itr;
}
```

## Usage example (C++)
Outputing hash values of all k-mers in a sequence

```C++
#include <iostream>
#include <string>
#include "ntHashIterator.hpp"

int main(int argc, const char* argv[])
{
	/* test sequence */
	std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";

	/* k is the k-mer length */
	unsigned k = 70;

	/* h is the number of hashes for each k-mer */
	unsigned h = 1;

	/* init ntHash state and compute hash values for first k-mer */
	ntHashIterator itr(seq, h, k);
	while (itr != itr.end()) {
		std::cout << (*itr)[0] << std::endl;
		++itr;
	}

	return 0;
}
```

Publications
============

## [ntHash](http://bioinformatics.oxfordjournals.org/content/early/2016/08/01/bioinformatics.btw397)

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol.
**ntHash: recursive nucleotide hashing**.
*Bioinformatics* (2016) 32 (22): 3492-3494.
[doi:10.1093/bioinformatics/btw397 ](http://dx.doi.org/10.1093/bioinformatics/btw397)


# acknowledgements

This projects uses:
* [CATCH](https://github.com/philsquared/Catch) unit test framework for C/C++