• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

man/H14-Jun-2021-4,4364,366

src/H03-May-2022-40,30226,494

.dockerignoreH A D14-Jun-202128 43

.gitignoreH A D14-Jun-2021239 2726

.travis.ymlH A D14-Jun-2021299 2217

DockerfileH A D14-Jun-2021378 1615

Makefile.amH A D14-Jun-202169 43

README.mdH A D14-Jun-202120.7 KiB336226

autogen.shH A D14-Jun-202139 31

configure.acH A D03-May-20222.6 KiB9778

README.md

1[![Build Status](https://travis-ci.org/torognes/vsearch.svg?branch=master)](https://travis-ci.org/torognes/vsearch)
2
3# VSEARCH
4
5## Introduction
6
7The aim of this project is to create an alternative to the [USEARCH](https://www.drive5.com/usearch/) tool developed by Robert C. Edgar (2010). The new tool should:
8
9* have open source code with an appropriate open source license
10* be free of charge, gratis
11* have a 64-bit design that handles very large databases and much more than 4GB of memory
12* be as accurate or more accurate than usearch
13* be as fast or faster than usearch
14
15We have implemented a tool called VSEARCH which supports *de novo* and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads.
16
17VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
18
19[VSEARCH binaries](https://github.com/torognes/vsearch/releases/latest) are provided for GNU/Linux on three 64-bit processor architectures: x86-64, POWER8 (ppc64le) and ARMv8 (aarch64). Binaries are also provided for MacOS (version 10.9 Mavericks or later) on Intel (x86-64) and Apple Silicon (ARMv8), as well as Windows (64-bit, version 7 or higher, on x86_64). VSEARCH contains dedicated SIMD code for the three processor architectures (SSE2/SSSE3, AltiVec/VMX/VSX, Neon).
20
21| CPU \ OS      | GNU/Linux     | MacOS  | Windows   |
22| ------------- | :-----------: | :----: | :-------: |
23| x86_64        |  ✔            |  ✔     |  ✔        |
24| ARMv8         |  ✔            |  ✔     |           |
25| POWER8        |  ✔            |        |           |
26
27Various packages, plugins and wrappers are also available from other sources - see [below](https://github.com/torognes/vsearch#packages-plugins-and-wrappers).
28
29The source code compiles correctly with `gcc` (versions 4.8.5 to 10.2)
30and `llvm-clang` (3.8 to 13.0). The source code should also compile on
31[FreeBSD](https://www.freebsd.org/) and
32[NetBSD](https://www.netbsd.org/) systems.
33
34VSEARCH can directly read input query and database files that are compressed using gzip and bzip2 (.gz and .bz2) if the zlib and bzip2 libraries are available.
35
36Most of the nucleotide based commands and options in USEARCH version 7 are supported, as well as some in version 8. The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement. VSEARCH does not support amino acid sequences or local alignments. These features may be added in the future.
37
38## Getting Help
39
40If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
41
42## Example
43
44In the example below, VSEARCH will identify sequences in the file database.fsa that are at least 90% identical on the plus strand to the query sequences in the file queries.fsa and write the results to the file alnout.txt.
45
46`./vsearch --usearch_global queries.fsa --db database.fsa --id 0.9 --alnout alnout.txt`
47
48## Download and install
49
50**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
51
52```
53wget https://github.com/torognes/vsearch/archive/v2.17.1.tar.gz
54tar xzf v2.17.1.tar.gz
55cd vsearch-2.17.1
56./autogen.sh
57./configure
58make
59make install  # as root or sudo make install
60```
61
62You may customize the installation directory using the `--prefix=DIR` option to `configure`. If the compression libraries [zlib](https://www.zlib.net) and/or [bzip2](https://www.sourceware.org/bzip2/) are installed on the system, they will be detected automatically and support for compressed files will be included in vsearch. Support for compressed files may be disabled using the `--disable-zlib` and `--disable-bzip2` options to `configure`. A PDF version of the manual will be created from the `vsearch.1` manual file if `ps2pdf` is available, unless disabled using the `--disable-pdfman` option to `configure`. Other  options may also be applied to `configure`, please run `configure -h` to see them all. GNU autoconf (version 2.63 or later), automake and the GCC C++ compiler is required to build vsearch.
63
64The Windows binary was compiled using the [Mingw-w64](http://mingw-w64.org/) C++ cross-compiler.
65
66**Cloning the repo** Instead of downloading the source distribution as a compressed archive, you could clone the repo and build it as shown below. The options to `configure` as described above are still valid.
67
68```
69git clone https://github.com/torognes/vsearch.git
70cd vsearch
71./autogen.sh
72./configure
73make
74make install  # as root or sudo make install
75```
76
77**Binary distribution** Starting with version 1.4.0, binary distribution files containing pre-compiled binaries as well as the documentation will be made available as part of each [release](https://github.com/torognes/vsearch/releases). The included executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`).
78
79Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) or Windows (64-bit, version 7 or higher), 64-bit AMDv8 (aarch64) systems running GNU/Linux or macOS, as well as POWER8 (ppc64le) systems running GNU/Linux.
80
81Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
82
83```sh
84wget https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch-2.17.1-linux-x86_64.tar.gz
85tar xzf vsearch-2.17.1-linux-x86_64.tar.gz
86```
87
88Or these commands if you are using a Linux ppc64le system:
89
90```sh
91wget https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch-2.17.1-linux-ppc64le.tar.gz
92tar xzf vsearch-2.17.1-linux-ppc64le.tar.gz
93```
94
95Or these commands if you are using a Linux aarch64 system:
96
97```sh
98wget https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch-2.17.1-linux-aarch64.tar.gz
99tar xzf vsearch-2.17.1-linux-aarch64.tar.gz
100```
101
102Or these commands if you are using a Mac:
103
104```sh
105wget https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch-2.17.1-macos-x86_64.tar.gz
106tar xzf vsearch-2.17.1-macos-x86_64.tar.gz
107```
108
109Or if you are using Windows, download and extract (unzip) the contents of this file:
110
111```
112https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch-2.17.1-win-x86_64.zip
113```
114
115Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.17.1-linux-x86_64` or `vsearch-2.17.1-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
116
117Windows: You will now have the binary distribution in a folder called `vsearch-2.17.1-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
118
119
120**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.17.1/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
121
122
123## Packages, plugins, and wrappers
124
125**Conda package** Thanks to the [BioConda](https://bioconda.github.io/) team, there is now a [vsearch package](https://anaconda.org/bioconda/vsearch) in [Conda](https://conda.io/).
126
127**Debian package** Thanks to the [Debian Med](https://www.debian.org/devel/debian-med/) team, there is now a [vsearch](https://packages.debian.org/sid/vsearch) package in [Debian](https://www.debian.org/).
128
129**FreeBSD ports package** Thanks to [Jason Bacon](https://github.com/outpaddling), a [vsearch](https://www.freebsd.org/cgi/ports.cgi?query=vsearch&stype=all) [FreeBSD ports](https://www.freebsd.org/ports/) package is available. Install the binary package with `pkg install vsearch`, or build from source with additional optimizations.
130
131**Galaxy wrapper** Thanks to the work of the [Intergalactic Utilities Commission](https://wiki.galaxyproject.org/IUC) members, vsearch is now part of the [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/view/iuc/vsearch/).
132
133**Homebrew package** Thanks to [Torsten Seeman](https://github.com/tseemann), a [vsearch package](https://formulae.brew.sh/formula/vsearch) for [Homebrew](http://brew.sh/) has been made.
134
135**Pkgsrc package** Thanks to [Jason Bacon](https://github.com/outpaddling), a vsearch [pkgsrc](https://www.pkgsrc.org) package is available for NetBSD and other UNIX-like systems. Install the binary package with `pkgin install vsearch`, or build from source with additional optimizations.
136
137**QIIME 2 plugin** Thanks to the [QIIME 2](https://github.com/qiime2) team, there is now a plugin called [q2-vsearch](https://github.com/qiime2/q2-vsearch) for [QIIME 2](https://qiime2.org).
138
139
140## Converting output to a biom file for use in QIIME and other software
141
142With the `from-uc`command in [biom](http://biom-format.org/) 2.1.5 or later, it is possible to convert data in a `.uc` file produced by vsearch into a biom file that can be read by QIIME and other software. It is described [here](https://gist.github.com/gregcaporaso/f3c042e5eb806349fa18).
143
144Please note that vsearch version 2.2.0 and later are able to directly output OTU tables in biom 1.0 format as well as the classic and mothur formats.
145
146
147## Implementation details and initial assessment
148
149Please see the paper for details:
150
151Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584
152doi: [10.7717/peerj.2584](https://doi.org/10.7717/peerj.2584)
153
154
155## Dependencies
156
157When compiling VSEARCH the header files for the following two optional libraries are required if support for gzip and bzip2 compressed FASTA and FASTQ input files is needed:
158
159* libz (zlib library) (zlib.h header file) (optional)
160* libbz2 (bzip2lib library) (bzlib.h header file) (optional)
161
162VSEARCH will automatically check whether these libraries are available and load them dynamically.
163
164On Windows these libraries are called zlib1.dll and bz2.dll.
165
166Unfortunately, VSEARCH will not work properly with all the different variants of the `zlib1.dll` file on Windows. One that works well is provided by the MinGW-w64 project and is found in the `bin` folder within the [zlib-1.2.5-bin-x64.zip](https://sourceforge.net/projects/mingw-w64/files/External%20binary%20packages%20%28Win64%20hosted%29/Binaries%20%2864-bit%29/zlib-1.2.5-bin-x64.zip) archive available on SourceForge. The MD5 of the `zlib1.dll` file should be `0f67ee0b965d3d29388c238aebcf60bc`.
167
168To create the PDF file with the manual the ps2pdf tool is required. It is part of the ghostscript package.
169
170
171## VSEARCH license and third party licenses
172
173The VSEARCH code is dual-licensed either under the GNU General Public License version 3 or under the BSD 2-clause license. Please see LICENSE.txt for details.
174
175VSEARCH includes code from several other projects. We thank the authors for making their source code available.
176
177VSEARCH includes code from Google's [CityHash project](https://github.com/google/cityhash) by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license.
178
179VSEARCH includes code derived from Tatusov and Lipman's DUST program that is in the public domain.
180
181VSEARCH includes public domain code written by Alexander Peslyak for the MD5 message digest algorithm.
182
183VSEARCH includes public domain code written by Steve Reid and others for the SHA1 message digest algorithm.
184
185The VSEARCH distribution includes code from GNU Autoconf which normally is available under the GNU General Public License, but may be distributed with the special autoconf configure script exception.
186
187VSEARCH may include code from the [zlib](https://www.zlib.net) library copyright Jean-loup Gailly and Mark Adler, distributed under the [zlib license](https://www.zlib.net/zlib_license.html).
188
189VSEARCH may include code from the [bzip2](https://www.sourceware.org/bzip2/) library copyright Julian R. Seward, distributed under a BSD-style license.
190
191
192## Code
193
194The code is written in C++ but most of it is actually mostly C with some C++ syntax conventions.
195
196File | Description
197---|---
198**align.cc** | New Needleman-Wunsch global alignment, serial. Only for testing.
199**align_simd.cc** | SIMD parallel global alignment of 1 query with 8 database sequences
200**allpairs.cc** | All-vs-all optimal global pairwise alignment (no heuristics)
201**arch.cc** | Architecture specific code (Mac/Linux)
202**attributes.cc** | Extraction and printing of attributes in FASTA headers
203**bitmap.cc** | Implementation of bitmaps
204**chimera.cc** | Chimera detection
205**city.cc** | CityHash code
206**cluster.cc** | Clustering (cluster\_fast and cluster\_smallmem)
207**cpu.cc** | Code dependent on specific cpu features (e.g. ssse3)
208**db.cc** | Handles the database file read, access etc
209**dbhash.cc** | Database hashing for exact searches
210**dbindex.cc** | Indexes the database by identifying unique kmers in the sequences
211**derep.cc** | Dereplication
212**dynlibs.cc** | Dynamic loading of compression libraries
213**eestats.cc** | Produce statistics for fastq_eestats command
214**fasta.cc** | FASTA file parser
215**fastq.cc** | FASTQ file parser
216**fastqjoin.cc** | FASTQ paired-end reads joining
217**fastqops.cc** | FASTQ file statistics etc
218**fastx.cc** | Detection of FASTA and FASTQ files, wrapper for FASTA and FASTQ parsers
219**filter.cc** | Trimming and filtering of sequences in FASTA and FASTQ files
220**getseq.cc** | Extraction of sequences based on header labels
221**kmerhash.cc** | Hash for kmers used by paired-end read merger
222**linmemalign.cc** | Linear memory global sequence aligner
223**maps.cc** | Various character mapping arrays
224**mask.cc** | Masking (DUST)
225**md5.c** | MD5 message digest
226**mergepairs.cc** | Paired-end read merging
227**minheap.cc** | A minheap implementation for the list of top kmer matches
228**msa.cc** | Simple multiple sequence alignment and consensus sequence computation for clusters
229**orient.cc** | Orient direction of sequences based on reference database
230**otutable.cc** | Generate OTU tables in various formats
231**rerep.cc** | Rereplication
232**results.cc** | Output results in various formats (alnout, userout, blast6, uc)
233**search.cc** | Implements search using global alignment
234**searchcore.cc** | Core search functions for searching, clustering and chimera detection
235**searchexact.cc** | Exact search functions
236**sffconvert.cc** | SFF to FASTQ file conversion
237**sha1.c** | SHA1 message digest
238**showalign.cc** | Output an alignment in a human-readable way given a CIGAR-string and the sequences
239**shuffle.cc** | Shuffle sequences
240**sortbylength.cc** | Code for sorting by length
241**sortbysize.cc** | Code for sorting by size (abundance)
242**subsample.cc** | Subsampling reads from a FASTA file
243**udb.cc** | UDB database file handling
244**unique.cc** | Find unique kmers in a sequence
245**userfields.cc** | Code for parsing the userfields option argument
246**util.cc** | Various common utility functions
247**vsearch.cc** | Main program file, general initialization, reads arguments and parses options, writes info.
248**xstring.h** | Code for a simple string class
249
250VSEARCH may be compiled with zlib or bzip2 integration that allows it to read compressed FASTA files. The [zlib](http://www.zlib.net/) and the [bzip2](https://www.sourceware.org/bzip2/) libraries are needed for this.
251
252
253## Bugs
254
255All bug reports are highly appreciated.
256You may submit a bug report here on GitHub as an [issue](https://github.com/torognes/vsearch/issues),
257you could post a message on the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum)
258or you could send an email to [torognes@ifi.uio.no](mailto:torognes@ifi.uio.no?subject=bug_in_vsearch).
259
260
261## Limitations
262
263VSEARCH is designed for rather short sequences, and will be slow when sequences are longer than about 5,000 bp. This is because it always performs optimal global alignment on selected sequences.
264
265
266## The VSEARCH team
267
268The main contributors to VSEARCH:
269
270* Torbj&oslash;rn Rognes <torognes@ifi.uio.no> (Coding, testing, documentation, evaluation)
271* Fr&eacute;d&eacute;ric Mah&eacute; <mahe@rhrk.uni-kl.de> (Documentation, testing, feature suggestions)
272* Tom&aacute;&scaron; Flouri <tomas.flouri@h-its.org> (Coding, testing)
273* Christopher Quince <c.quince@warwick.ac.uk> (Initiator, feature suggestions, evaluation)
274* Ben Nichols <b.nichols.1@research.gla.ac.uk> (Evaluation)
275
276
277## Acknowledgements
278
279Special thanks to the following people for patches, suggestions, computer access etc:
280
281* Davide Albanese
282* Colin Brislawn
283* Jeff Epler
284* Christopher M. Sullivan
285* Andreas Tille
286* Sarah Westcott
287
288## Citing VSEARCH
289
290Please cite the following publication if you use VSEARCH:
291
292Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584.
293doi: [10.7717/peerj.2584](https://doi.org/10.7717/peerj.2584)
294
295Please note that citing any of the underlying algorithms, e.g. UCHIME, may also be appropriate.
296
297## Test datasets
298
299Test datasets (found in the separate vsearch-data repository) were
300obtained from
301the BioMarks project (Logares et al. 2014),
302the [TARA OCEANS project](https://oceans.taraexpeditions.org/en/) (Karsenti et al. 2011)
303and the [Protist Ribosomal Reference Database (PR<sup>2</sup>)](https://github.com/pr2database/pr2database) (Guillou et al. 2013).
304
305
306## References
307
308* Edgar RC (2010)
309**Search and clustering orders of magnitude faster than BLAST.**
310*Bioinformatics*, 26 (19): 2460-2461.
311doi:[10.1093/bioinformatics/btq461](https://doi.org/10.1093/bioinformatics/btq461)
312
313* Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R (2011)
314**UCHIME improves sensitivity and speed of chimera detection.**
315*Bioinformatics*, 27 (16): 2194-2200.
316doi:[10.1093/bioinformatics/btr381](https://doi.org/10.1093/bioinformatics/btr381)
317
318* Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud G, de Vargas C, Decelle J, del Campo J, Dolan J, Dunthorn M, Edvardsen B, Holzmann M, Kooistra W, Lara E, Lebescot N, Logares R, Mahé F, Massana R, Montresor M, Morard R, Not F, Pawlowski J, Probert I, Sauvadet A-L, Siano R, Stoeck T, Vaulot D, Zimmermann P & Christen R (2013)
319**The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy.**
320*Nucleic Acids Research*, 41 (D1), D597-D604.
321doi:[10.1093/nar/gks1160](https://doi.org/10.1093/nar/gks1160)
322
323* Karsenti E, González Acinas S, Bork P, Bowler C, de Vargas C, Raes J, Sullivan M B, Arendt D, Benzoni F, Claverie J-M, Follows M, Jaillon O, Gorsky G, Hingamp P, Iudicone D, Kandels-Lewis S, Krzic U, Not F, Ogata H, Pesant S, Reynaud E G, Sardet C, Sieracki M E, Speich S, Velayoudon D, Weissenbach J, Wincker P & the Tara Oceans Consortium (2011)
324**A holistic approach to marine eco-systems biology.**
325*PLoS Biology*, 9(10), e1001177.
326doi:[10.1371/journal.pbio.1001177](https://doi.org/10.1371/journal.pbio.1001177)
327
328* Logares R, Audic S, Bass D, Bittner L, Boutte C, Christen R, Claverie J-M, Decelle J, Dolan J R, Dunthorn M, Edvardsen B, Gobet A, Kooistra W H C F, Mahé F, Not F, Ogata H, Pawlowski J, Pernice M C, Romac S, Shalchian-Tabrizi K, Simon N, Stoeck T, Santini S, Siano R, Wincker P, Zingone A, Richards T, de Vargas C & Massana R (2014) **The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes.**
329*Current Biology*, 24(8), 813-821.
330doi:[10.1016/j.cub.2014.02.050](https://doi.org/10.1016/j.cub.2014.02.050)
331
332* Rognes T (2011)
333**Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation.**
334*BMC Bioinformatics*, 12: 221.
335doi:[10.1186/1471-2105-12-221](https://doi.org/10.1186/1471-2105-12-221)
336