1============================== 2Mappy: Minimap2 Python Binding 3============================== 4 5Mappy provides a convenient interface to `minimap2 6<https://github.com/lh3/minimap2>`_, a fast and accurate C program to align 7genomic and transcribe nucleotide sequences. 8 9Installation 10------------ 11 12Mappy depends on `zlib <http://zlib.net>`_. It can be installed with `pip 13<https://en.wikipedia.org/wiki/Pip_(package_manager)>`_: 14 15.. code:: shell 16 17 pip install --user mappy 18 19or from the minimap2 github repo (`Cython <http://cython.org>`_ required): 20 21.. code:: shell 22 23 git clone https://github.com/lh3/minimap2 24 cd minimap2 25 python setup.py install 26 27Usage 28----- 29 30The following Python script demonstrates the key functionality of mappy: 31 32.. code:: python 33 34 import mappy as mp 35 a = mp.Aligner("test/MT-human.fa") # load or build index 36 if not a: raise Exception("ERROR: failed to load/build index") 37 s = a.seq("MT_human", 100, 200) # retrieve a subsequence from the index 38 print(mp.revcomp(s)) # reverse complement 39 for name, seq, qual in mp.fastx_read("test/MT-orang.fa"): # read a fasta/q sequence 40 for hit in a.map(seq): # traverse alignments 41 print("{}\t{}\t{}\t{}".format(hit.ctg, hit.r_st, hit.r_en, hit.cigar_str)) 42 43APIs 44---- 45 46Mappy implements two classes and two global function. 47 48Class mappy.Aligner 49~~~~~~~~~~~~~~~~~~~ 50 51.. code:: python 52 53 mappy.Aligner(fn_idx_in=None, preset=None, ...) 54 55This constructor accepts the following arguments: 56 57* **fn_idx_in**: index or sequence file name. Minimap2 automatically tests the 58 file type. If a sequence file is provided, minimap2 builds an index. The 59 sequence file can be optionally gzip'd. This option has no effect if **seq** 60 is set. 61 62* **seq**: a single sequence to index. The sequence name will be set to 63 :code:`N/A`. 64 65* **preset**: minimap2 preset. Currently, minimap2 supports the following 66 presets: **sr** for single-end short reads; **map-pb** for PacBio 67 read-to-reference mapping; **map-ont** for Oxford Nanopore read mapping; 68 **splice** for long-read spliced alignment; **asm5** for assembly-to-assembly 69 alignment; **asm10** for full genome alignment of closely related species. Note 70 that the Python module does not support all-vs-all read overlapping. 71 72* **k**: k-mer length, no larger than 28 73 74* **w**: minimizer window size, no larger than 255 75 76* **min_cnt**: mininum number of minimizers on a chain 77 78* **min_chain_score**: minimum chaing score 79 80* **bw**: chaining and alignment band width 81 82* **best_n**: max number of alignments to return 83 84* **n_threads**: number of indexing threads; 3 by default 85 86* **extra_flags**: additional flags defined in minimap.h 87 88* **fn_idx_out**: name of file to which the index is written. This parameter 89 has no effect if **seq** is set. 90 91* **scoring**: scoring system. It is a tuple/list consisting of 4, 6 or 7 92 positive integers. The first 4 elements specify match scoring, mismatch 93 penalty, gap open and gap extension penalty. The 5th and 6th elements, if 94 present, set long-gap open and long-gap extension penalty. The 7th sets a 95 mismatch penalty involving ambiguous bases. 96 97.. code:: python 98 99 mappy.Aligner.map(seq, seq2=None, cs=False, MD=False) 100 101This method aligns :code:`seq` against the index. It is a generator, *yielding* 102a series of :code:`mappy.Alignment` objects. If :code:`seq2` is present, mappy 103performs paired-end alignment, assuming the two ends are in the FR orientation. 104Alignments of the two ends can be distinguished by the :code:`read_num` field 105(see Class mappy.Alignment below). Argument :code:`cs` asks mappy to generate 106the :code:`cs` tag; :code:`MD` is similar. These two arguments might slightly 107degrade performance and are not enabled by default. 108 109.. code:: python 110 111 mappy.Aligner.seq(name, start=0, end=0x7fffffff) 112 113This method retrieves a (sub)sequence from the index and returns it as a Python 114string. :code:`None` is returned if :code:`name` is not present in the index or 115the start/end coordinates are invalid. 116 117.. code:: python 118 119 mappy.Aligner.seq_names 120 121This property gives the array of sequence names in the index. 122 123Class mappy.Alignment 124~~~~~~~~~~~~~~~~~~~~~ 125 126This class describes an alignment. An object of this class has the following 127properties: 128 129* **ctg**: name of the reference sequence the query is mapped to 130 131* **ctg_len**: total length of the reference sequence 132 133* **r_st** and **r_en**: start and end positions on the reference 134 135* **q_st** and **q_en**: start and end positions on the query 136 137* **strand**: +1 if on the forward strand; -1 if on the reverse strand 138 139* **mapq**: mapping quality 140 141* **blen**: length of the alignment, including both alignment matches and gaps 142 but excluding ambiguous bases. 143 144* **mlen**: length of the matching bases in the alignment, excluding ambiguous 145 base matches. 146 147* **NM**: number of mismatches, gaps and ambiguous positions in the alignment 148 149* **trans_strand**: transcript strand. +1 if on the forward strand; -1 if on the 150 reverse strand; 0 if unknown 151 152* **is_primary**: if the alignment is primary (typically the best and the first 153 to generate) 154 155* **read_num**: read number that the alignment corresponds to; 1 for the first 156 read and 2 for the second read 157 158* **cigar_str**: CIGAR string 159 160* **cigar**: CIGAR returned as an array of shape :code:`(n_cigar,2)`. The two 161 numbers give the length and the operator of each CIGAR operation. 162 163* **MD**: the :code:`MD` tag as in the SAM format. It is an empty string unless 164 the :code:`MD` argument is applied when calling :code:`mappy.Aligner.map()`. 165 166* **cs**: the :code:`cs` tag. 167 168An :code:`Alignment` object can be converted to a string with :code:`str()` in 169the following format: 170 171:: 172 173 q_st q_en strand ctg ctg_len r_st r_en mlen blen mapq cg:Z:cigar_str 174 175It is effectively the PAF format without the QueryName and QueryLength columns 176(the first two columns in PAF). 177 178Miscellaneous Functions 179~~~~~~~~~~~~~~~~~~~~~~~ 180 181.. code:: python 182 183 mappy.fastx_read(fn, read_comment=False) 184 185This generator function opens a FASTA/FASTQ file and *yields* a 186:code:`(name,seq,qual)` tuple for each sequence entry. The input file may be 187optionally gzip'd. If :code:`read_comment` is True, this generator yields 188a :code:`(name,seq,qual,comment)` tuple instead. 189 190.. code:: python 191 192 mappy.revcomp(seq) 193 194Return the reverse complement of DNA string :code:`seq`. This function 195recognizes IUB code and preserves the letter cases. Uracil :code:`U` is 196complemented to :code:`A`. 197