1Edlib 2· 3[![Latest Github release](https://img.shields.io/github/release/Martinsos/edlib.svg)](https://github.com/Martinsos/edlib/releases/latest) 4[![Build status of the master branch on Linux/OSX](https://img.shields.io/travis/Martinsos/edlib/master?label=Linux%20%2F%20OSX%20build)](https://travis-ci.com/Martinsos/edlib) 5[![Build status of the master branch on Windows](https://img.shields.io/appveyor/build/Martinsos/edlib/master?label=Windows%20build)](https://ci.appveyor.com/project/Martinsos/edlib/branch/master) 6[![Chat on Gitter](https://img.shields.io/gitter/room/Martinsos/edlib.svg?colorB=753a88)](https://gitter.im/Martinsos/edlib) 7[![Published in Bioinformatics](https://img.shields.io/badge/Published%20in-Bioinformatics-167DA4.svg)](https://doi.org/10.1093/bioinformatics/btw753) 8===== 9 10A lightweight and super fast C/C++ library for sequence alignment using [edit distance](https://en.wikipedia.org/wiki/Edit_distance). 11 12Calculating edit distance of two strings is as simple as: 13```c 14edlibAlign("hello", 5, "world!", 6, edlibDefaultAlignConfig()).editDistance; 15``` 16 17Edlib is also available for **Python** [![PyPI version](https://img.shields.io/pypi/v/edlib.svg) (Click here for Python README)](https://pypi.python.org/pypi/edlib), with code residing at [bindings/python](bindings/python). 18 19There is also non-official [binding for Julia](https://github.com/cjdoris/Edlib.jl) by @cjdoris. 20 21## Features 22* Calculates **edit distance (Levenshtein distance)**. 23* It can find **optimal alignment path** (instructions how to transform first sequence into the second sequence). 24* It can find just the **start and/or end locations of alignment path** - can be useful when speed is more important than having exact alignment path. 25* Supports **multiple [alignment methods](#alignment-methods)**: global(**NW**), prefix(**SHW**) and infix(**HW**), each of them useful for different scenarios. 26* You can **extend character equality definition**, enabling you to e.g. have wildcard characters, to have case insensitive alignment or to work with degenerate nucleotides. 27* It can easily handle small or **very large sequences**, even when finding alignment path, while consuming very little memory. 28* **Super fast** thanks to Myers's bit-vector algorithm. 29 30 31## Contents 32- [Features](#features) 33- [Building](#building) 34- [Using Edlib in your project](#using-edlib-in-your-project) 35- [Usage and examples](#usage-and-examples) 36- [API documentation](#api-documentation) 37- [Alignment methods](#alignment-methods) 38- [Aligner](#aligner) 39- [Running tests](#running-tests) 40- [Time and space complexity](#time-and-space-complexity) 41- [Test data](#test-data) 42- [Development and contributing](#development-and-contributing) 43- [Publication](#publication) 44- [Acknowledgements](#acknowledgements) 45 46 47## Building 48### Meson 49Primary way of building Edlib is via [Meson](https://mesonbuild.com/) build tool. 50 51Requirements: make sure that you have `meson` installed on your system. 52 53Execute 54``` 55make 56``` 57to build **static** library and binaries (apps and tests) and also run tests. 58To build **shared** library and binaries, do `make LIBRARY_TYPE=shared`. 59 60Library and binaries will be created in `meson-build` directory. 61You can choose alternate build directory like this: `make BUILD_DIR=some-other-dir`. 62 63Optionally, you can run 64``` 65sudo make install 66``` 67to install edlib library on your machine (on Linux, this will usually install it to `usr/local/lib` and `usr/local/include`). 68 69Check Makefile if you want to run individual steps on your own (building, tests, ...). 70 71NOTE: If you need more control, use `meson` command directly, `Makefile` is here only to help with common commands. 72 73### CMake 74Edlib can alternatively be built with CMake. 75 76Execute following command to build Edlib using CMAKE: 77``` 78cd build && cmake -D CMAKE_BUILD_TYPE=Release .. && make 79``` 80This will create binaries in `bin/` directory and libraries (static and shared) in `lib/` directory. 81 82``` 83./bin/runTests 84``` 85to run tests. 86 87Optionally, you can run 88``` 89sudo make install 90``` 91to install edlib library on your machine. 92 93### Conda 94Edlib can also be installed via Conda: [![Anaconda-Server Badge](https://anaconda.org/bioconda/edlib/badges/installer/conda.svg)](https://conda.anaconda.org/bioconda): `conda install edlib`. 95 96 97## Using Edlib in your project 98You can use Edlib in you project by either directly copying header and source files from [edlib/](edlib/), or by linking Edlib library (see [Building](#building) for instructions how to build Edlib libraries). 99In any case, only thing that you have to do in your source files is to include `edlib.h`. 100 101To get you started quickly, let's take a look at a few ways to get simple Hello World project working. 102 103Our Hello World project has just one source file, `helloWorld.cpp` file, and it looks like this: 104```cpp 105#include <cstdio> 106#include "edlib.h" 107 108int main() { 109 EdlibAlignResult result = edlibAlign("hello", 5, "world!", 6, edlibDefaultAlignConfig()); 110 if (result.status == EDLIB_STATUS_OK) { 111 printf("edit_distance('hello', 'world!') = %d\n", result.editDistance); 112 } 113 edlibFreeAlignResult(result); 114} 115``` 116 117Running it should output `edit_distance('hello', 'world!') = 5`. 118 119### Approach #1: Directly copying edlib source and header files. 120Here we directly copied [edlib/](edlib/) directory to our project, to get following project structure: 121``` 122edlib/ -> copied from edlib/ 123 include/ 124 edlib.h 125 src/ 126 edlib.cpp 127helloWorld.cpp -> your program 128``` 129 130Since `helloWorld` is a c++ program, we can compile it with just one line: `c++ helloWorld.cpp edlib/src/edlib.cpp -o helloWorld -I edlib/include`. 131 132If hello world was a C program, we would compile it like this: 133``` 134 c++ -c edlib/src/edlib.cpp -o edlib.o -I edlib/include 135 cc -c helloWorld.c -o helloWorld.o -I edlib/include 136 c++ helloWorld.o edlib.o -o helloWorld 137``` 138 139### Approach #2: Copying edlib header file and static library. 140Instead of copying edlib source files, you could copy static library (check [Building](#building) on how to create static library). We also need to copy edlib header files. We get following project structure: 141``` 142edlib/ -> copied from edlib 143 include/ 144 edlib.h 145 edlib.a 146helloWorld.cpp -> your program 147``` 148 149Now you can compile it with `c++ helloWorld.cpp -o helloWorld -I edlib/include -L edlib -ledlib`. 150 151### Approach #3: Install edlib library on machine. 152Alternatively, you could avoid copying any Edlib files and instead install libraries by running `sudo make install` (check [Building](#building) for exact instructions depending on approach you used for building). Now, all you have to do to compile your project is `c++ helloWorld.cpp -o helloWorld -ledlib`. 153If you get error message like `cannot open shared object file: No such file or directory`, make sure that your linker includes path where edlib was installed. 154 155### Approach #4: Use edlib in your project via CMake. 156If you are using CMake for compilation, we suggest adding edlib as a git submodule with the command `git submodule add https://github.com/martinsos/edlib vendor/edlib`. Afterwards, modify your top level CMakeLists.txt file accordingly: 157``` 158add_subdirectory(vendor/edlib EXCLUDE_FROM_ALL) 159target_link_libraries(your_exe edlib) # or target_link_libraries(your_exe edlib) 160``` 161The `add_subdirectory` command adds a folder to the build tree, meaning it will run CMakeLists.txt from the included folder as well. Flag `EXCLUDE_FROM_ALL` disables building (and instalment) of targets in the added folder which are not needed in your project. In the above example only the (static) library `edlib` will be build, while `edlib-aligner`, `hello_world` and the rest won't. In order to access the `edlib` API, add `#include "edlib.h"` in your source file (CMake will automatically update your include path). 162 163 164For more example projects take a look at applications in [apps/](apps/). 165 166 167## Usage and examples 168Main function in edlib is `edlibAlign`. Given two sequences (and their lengths), it will find edit distance, alignment path or its end and start locations. 169 170```c 171char* query = "ACCTCTG"; 172char* target = "ACTCTGAAA" 173EdlibAlignResult result = edlibAlign(query, 7, target, 9, edlibDefaultAlignConfig()); 174if (result.status == EDLIB_STATUS_OK) { 175 printf("%d", result.editDistance); 176} 177edlibFreeAlignResult(result); 178``` 179 180NOTE: One character is expected to occupy one char/byte, meaning that characters spanning multiple chars/bytes are not supported. As long as your alphabet size is <= 256 you can manually map it to numbers/chars from 0 to 255 and solve this that way, but if its size is > 256 then you will not be able to use Edlib. 181 182### Configuring edlibAlign() 183`edlibAlign` takes configuration object (it is a struct `EdlibAlignConfig`), which allows you to further customize how alignment will be done. You can choose [alignment method](#alignment-methods), tell edlib what to calculate (just edit distance or also path and locations) and set upper limit for edit distance. 184 185For example, if you want to use infix(HW) alignment method, want to find alignment path (and edit distance), are interested in result only if edit distance is not larger than 42 and do not want to extend character equality definition, you would call it like this: 186```c 187edlibAlign(seq1, seq1Length, seq2, seq2Length, 188 edlibNewAlignConfig(42, EDLIB_MODE_HW, EDLIB_TASK_PATH, NULL, 0)); 189``` 190Or, if you want to use suffix(SHW) alignment method, want to find only edit distance, do not have any limits on edit distance and want character '?' to match both itself and characters 'X' and 'Y', you would call it like this: 191```c 192EdlibEqualityPair additionalEqualities[2] = {{'?', 'X'}, {'?', 'Y'}}; 193edlibAlign(seq1, seq1Length, seq2, seq2Length, 194 edlibNewAlignConfig(-1, EDLIB_MODE_SHW, EDLIB_TASK_DISTANCE, additionalEqualities, 2)); 195``` 196 197We used `edlibNewAlignConfig` helper function to easily create config, however we could have also just created an instance of it and set its members accordingly. 198 199### Handling result of edlibAlign() 200`edlibAlign` function returns a result object (`EdlibAlignResult`), which will contain results of alignment (corresponding to the task that you passed in config). 201 202```c 203EdlibAlignResult result = edlibAlign(seq1, seq1Length, seq2, seq2Length, 204 edlibNewAlignConfig(-1, EDLIB_MODE_HW, EDLIB_TASK_PATH, NULL, 0)); 205if (result.status == EDLIB_STATUS_OK) { 206 printf("%d\n", result.editDistance); 207 printf("%d\n", result.alignmentLength); 208 printf("%d\n", result.endLocations[0]); 209} 210edlibFreeAlignResult(result); 211``` 212 213It is important to remember to free the result object using `edlibFreeAlignResult` function, since Edlib allocates memory on heap for certain members. If you decide to do the cleaning manually and not use `edlibFreeAlignResult`, do not forget to manually `free()` required members. 214 215### Turning alignment to cigar 216Cigar is a standard way to represent alignment path. 217Edlib has helper function that transforms alignment path into cigar. 218```c 219char* cigar = edlibAlignmentToCigar(result.alignment, result.alignmentLength, EDLIB_CIGAR_STANDARD); 220printf("%s", cigar); 221free(cigar); 222``` 223 224## API documentation 225 226For complete documentation of Edlib library API, visit [http://martinsos.github.io/edlib](https://martinsos.github.io/edlib) (should be updated to the latest release). 227 228To generate the latest API documentation yourself from the source, you need to have [doxygen](www.doxygen.org) installed. 229Position yourself in the root directory and run `doxygen`, this will generate `docs/` directory. Then open `docs/html/index.html` file with you favorite browser. 230 231Alternatively, you can directly check [edlib.h](edlib/include/edlib.h). 232 233## Alignment methods 234 235Edlib supports 3 alignment methods: 236* **global (NW)** - This is the standard method, when we say "edit distance" this is the method that is assumed. 237 It tells us the smallest number of operations needed to transform first sequence into second sequence. 238 *This method is appropriate when you want to find out how similar is first sequence to second sequence.* 239* **prefix (SHW)** - Similar to global method, but with a small twist - gap at query end is not penalized. What that means is that deleting elements from the end of second sequence is "free"! 240 For example, if we had `AACT` and `AACTGGC`, edit distance would be 0, because removing `GGC` from the end of second sequence is "free" and does not count into total edit distance. 241 *This method is appropriate when you want to find out how well first sequence fits at the beginning of second sequence.* 242* **infix (HW)**: Similar as prefix method, but with one more twist - gaps at query end **and start** are not penalized. What that means is that deleting elements from the start and end of second sequence is "free"! 243 For example, if we had `ACT` and `CGACTGAC`, edit distance would be 0, because removing `CG` from the start and `GAC` from the end of second sequence is "free" and does not count into total edit distance. 244 *This method is appropriate when you want to find out how well first sequence fits at any part of second sequence.* For example, if your second sequence was a long text and your first sequence was a sentence from that text, but slightly scrambled, you could use this method to discover how scrambled it is and where it fits in that text. 245 *In bioinformatics, this method is appropriate for aligning read to a sequence.* 246 247 248## Aligner 249Edlib comes with a standalone aligner cli app, which can be found at [apps/aligner/](apps/aligner). 250 251![Edlib aligner screenshot](images/edlib-aligner-screenshot.png) 252 253Aligner reads sequences from fasta files, and it can display alignment path in graphical manner or as a cigar. 254It also measures calculation time, so it can be useful for testing speed and comparing Edlib with other tools. 255 256Check [Building](#building) to see how to build binaries (including `edlib-aligner`). 257Run `./build/bin/edlib-aligner` with no params for help and detailed instructions. 258 259Example of usage: 260`./build/bin/edlib-aligner -p apps/aligner/test_data/query.fasta apps/aligner/test_data/target.fasta` 261 262**NOTE**: Aligner currently does not work on Windows, because it uses `getopt` to parse command line arguments, which is not supported on Windows. 263 264 265## Running tests 266Check [Building](#building) to see how to build binaries (including binary `runTests`). 267To run tests, just run `./runTests`. This will run random tests for each alignment method, and also some specific unit tests. 268 269 270## Time and space complexity 271Edlib is based on [Myers's bit-vector algorithm](http://www.gersteinlab.org/courses/452/09-spring/pdf/Myers.pdf) and extends from it. 272It calculates a dynamic programming matrix of dimensions `Q x T`, where `Q` is the length of the first sequence (query), and `T` is the length of the second sequence (target). It uses Ukkonen's banded algorithm to reduce the space of search, and there is also parallelization from Myers's algorithm, however time complexity is still quadratic. 273Edlib uses Hirschberg's algorithm to find alignment path, therefore space complexity is linear. 274 275Time complexity: `O(T * Q)`. 276 277Space complexity: `O(T + Q)`. 278 279It is worth noting that Edlib works best for large, similar sequences, since such sequences get the highest speedup from banded approach and bit-vector parallelization. 280 281 282## Test data 283In [test_data/](test_data) directory there are different genome sequences, ranging from 10 kbp to 5 Mbp in length. They are ranging in length and similarity, so they can be useful for testing and measuring speed in different scenarios. 284 285 286## Development and contributing 287Feel free to send pull requests and raise issues. 288 289When developing, you may want to use `-D CMAKE_BUILD_TYPE=Debug` flag when calling `cmake` in order to get debugging flags passed to compiler. This should also happen if you just run `cmake ..` with no flags, but I think I have noticed it does not always works as expected (probably has something to do with cmake cache). To check which flags is compiler using, run `make` with `VERBOSE=1`: `make VERBOSE=1`. 290 291 292## Publication 293 294Martin Šošić, Mile Šikić; Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 2017 btw753. doi: [10.1093/bioinformatics/btw753](https://doi.org/10.1093/bioinformatics/btw753) 295 296 297## Acknowledgements 298 299Mile Šikić (@msikic) - Mentoring and guidance through whole project. 300 301Ivan Sović (@isovic) - Help with testing and prioritizing features, valuable comments on the manuscript. 302 303## FAQ 304 305### What do terms NW, HW and SHW stand for? 306NW stands for Needleman-Wunsch, HW for Hybrid Wunsch, and SHW for Semi Hybrid Wunsch. While NW is a common abbreviation, HW and SHW abbreviations were made up at the very start of this project to describe additional modes of alignment. Later we started using terms "global", "infix" and "prefix" more, as they describe the modes better, but terms NW, HW and SHW are still very present in the project. 307