1=============== 2Recipes and FAQ 3=============== 4 5This section gives answers to frequently asked questions. It shows you how to 6get Cutadapt to do what you want it to do! 7 8 9Remove more than one adapter 10---------------------------- 11 12If you want to remove a 5' and 3' adapter at the same time, :ref:`use the 13support for linked adapters <linked-adapters>`. 14 15If your situation is different, for example, when you have many 5' adapters 16but only one 3' adapter, then you have two options. 17 18First, you can specify the adapters and also ``--times=2`` (or the short 19version ``-n 2``). For example:: 20 21 cutadapt -g ^TTAAGGCC -g ^AAGCTTA -a TACGGACT -n 2 -o output.fastq input.fastq 22 23This instructs Cutadapt to run two rounds of adapter finding and removal. That 24means that, after the first round and only when an adapter was actually found, 25another round is performed. In both rounds, all given adapters are searched and 26removed. The problem is that it could happen that one adapter is found twice (so 27the 3' adapter, for example, could be removed twice). 28 29The second option is to not use the ``-n`` option, but to run Cutadapt twice, 30first removing one adapter and then the other. It is easiest if you use a pipe 31as in this example:: 32 33 cutadapt -g ^TTAAGGCC -g ^AAGCTTA input.fastq | cutadapt -a TACGGACT - > output.fastq 34 35 36Trim poly-A tails 37----------------- 38 39If you want to trim a poly-A tail from the 3' end of your reads, use the 3' 40adapter type (``-a``) with an adapter sequence of many repeated ``A`` 41nucleotides. Starting with version 1.8 of Cutadapt, you can use the 42following notation to specify a sequence that consists of 100 ``A``:: 43 44 cutadapt -a "A{100}" -o output.fastq input.fastq 45 46This also works when there are sequencing errors in the poly-A tail. So this 47read :: 48 49 TACGTACGTACGTACGAAATAAAAAAAAAAA 50 51will be trimmed to:: 52 53 TACGTACGTACGTACG 54 55If for some reason you would like to use a shorter sequence of ``A``, you can 56do so: The matching algorithm always picks the leftmost match that it can find, 57so Cutadapt will do the right thing even when the tail has more ``A`` than you 58used in the adapter sequence. However, sequencing errors may result in shorter 59matches than desired. For example, using ``-a "A{10}"``, the read above (where 60the ``AAAT`` is followed by eleven ``A``) would be trimmed to:: 61 62 TACGTACGTACGTACGAAAT 63 64Depending on your application, perhaps a variant of ``-a A{10}N{90}`` is an 65alternative, forcing the match to be located as much to the left as possible, 66while still allowing for non-``A`` bases towards the end of the read. 67 68 69Trim a fixed number of bases after adapter trimming 70--------------------------------------------------- 71 72If the adapters you want to remove are preceded by some unknown sequence (such 73as a random tag/molecular identifier), you can specify this as part of the 74adapter sequence in order to remove both in one go. 75 76For example, assume you want to trim Illumina adapters preceded by 10 bases 77that you want to trim as well. Instead of this command:: 78 79 cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ... 80 81Use this command:: 82 83 cutadapt -O 13 -a N{10}AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ... 84 85The ``-O 13`` is the minimum overlap for an adapter match, where the 13 is 86computed as 3 plus 10 (where 3 is the default minimum overlap and 10 is the 87length of the unknown section). If you do not specify it, the adapter sequence 88would match the end of every read (because ``N`` matches anything), and ten 89bases would then be removed from every read. 90 91 92Trimming (amplicon-) primers from both ends of paired-end reads 93--------------------------------------------------------------- 94 95If you want to remove primer sequences that flank your sequence of 96interest, you should use a :ref:`"linked adapter" <linked-adapters>` 97to remove them. If you have paired-end data (with R1 and R2), you 98can correctly trim both R1 and R2 by using linked adapters for both 99R1 and R2. Here is how to do this. 100 101The full DNA fragment that is put on the sequencer looks like this 102(looking only at the forward strand): 103 104 5' sequencing primer -- forward primer -- sequence of interest -- reverse complement of reverse primer -- reverse complement of 3' sequencing primer 105 106Since sequencing of R1 starts after the 5' sequencing primer, R1 will 107start with the forward primer and then continue into the sequence of 108interest and into the two primers to the right of it, depending on 109the read length and how long the sequence of interest is. For R1, 110the linked adapter option that needs to be used is therefore :: 111 112 -a FWDPRIMER...RCREVPRIMER 113 114where ``FWDPRIMER`` needs to be replaced with the sequence of your 115forward primer and ``RCREVPRIMER`` with the reverse complement of 116the reverse primer. The three dots ``...`` need to be entered 117as they are -- they tell Cutadapt that this is a linked adapter 118with a 5' and a 3' part. 119 120Sequencing of R2 starts before the 3' sequencing primer and 121proceeds along the reverse-complementary strand. For the correct 122linked adapter, the sequences from above therefore need to be 123swapped and reverse-complemented:: 124 125 -A REVPRIMER...RCFWDPRIMER 126 127The uppercase ``-A`` specifies that this option is 128meant to work on R2. Similar to above, ``REVPRIMER`` is 129the sequence of the reverse primer and ``RCFWDPRIMER`` is the 130reverse-complement of the forward primer. Note that Cutadapt 131does not reverse-complement any sequences of its own; you 132will have to do that yourself. 133 134Finally, you may want to filter the trimmed read pairs. 135Use ``--discard-untrimmed`` to throw away all read pairs in 136which R1 doesn’t start with ``FWDPRIMER`` or in which R2 137does not start with ``REVPRIMER``. 138 139A note on how the filtering works: In linked adapters, by default 140the first part (before the ``...``) is anchored. Anchored 141sequences *must* occur. If they don’t, then the other sequence 142(after the ``...``) is not even searched for and the entire 143read is internally marked as “untrimmed”. This is done for both 144R1 and R2 and as soon as *any* of them is marked as “untrimmed”, 145the entire pair is considered to be “untrimmed”. If 146``--discard-untrimmed`` is used, this means that the entire 147pair is discarded if R1 or R2 are untrimmed. (Option 148``--pair-filter=both`` can be used to change this to require 149that *both* were marked as untrimmed.) 150 151In summary, this is how to trim your data and discard all 152read pairs that do not contain the primer sequences that 153you know must be there:: 154 155 cutadapt -a FWDPRIMER...RCREVPRIMER -A REVPRIMER...RCFWDPRIMER --discard-untrimmed -o out.1.fastq.gz -p out.2.fastq.gz in.1.fastq.gz in.2.fastq.gz 156 157 158Piping paired-end data 159---------------------- 160 161Sometimes it is necessary to run Cutadapt twice on your data. For example, when 162you want to change the order in which read modification or filtering options are 163applied. To simplify this, you can use Unix pipes (``|``), but this is more 164difficult with paired-end data since then input and output consists of two files 165each. 166 167The solution is to interleave the paired-end data, send it over the pipe 168and then de-interleave it in the other process. Here is how this looks in 169principle:: 170 171 cutadapt [options] --interleaved in.1.fastq.gz in.2.fastq.gz | \ 172 cutadapt [options] --interleaved -o out.1.fastq.gz -p out.2.fastq.gz - 173 174Note the ``-`` character in the second invocation to Cutadapt. 175 176 177Support for concatenated compressed files 178----------------------------------------- 179 180Cutadapt supports concatenated gzip and bzip2 input files. 181 182 183Paired-end read name check 184-------------------------- 185 186When reading paired-end files, Cutadapt checks whether the read names match. 187Only the part of the read name before the first space is considered. If the 188read name ends with ``1`` or ``2``, then that is also ignored. For example, 189two FASTQ headers that would be considered to denote properly paired reads are:: 190 191 @my_read/1 a comment 192 193and:: 194 195 @my_read/2 another comment 196 197This is an example for *improperly paired* read names:: 198 199 @my_read/1;1 200 201and:: 202 203 @my_read/2;1 204 205Since the ``1`` and ``2`` are ignored only if the occur at the end of the read 206name, and since the ``;1`` is considered to be part of the read name, these 207reads will not be considered to be propely paired. 208 209 210Rescuing single reads from paired-end reads that were filtered 211-------------------------------------------------------------- 212 213When trimming and filtering paired-end reads, Cutadapt always discards entire read pairs. If you 214want to keep one of the reads, you need to write the filtered read pairs to an output file and 215postprocess it. 216 217For example, assume you are using ``-m 30`` to discard too short reads. Cutadapt discards all 218read pairs in which just one of the reads is too short (but see the ``--pair-filter`` option). 219To recover those (individual) reads that are long enough, you can first use the 220``--too-short-(paired)-output`` options to write the filtered pairs to a file, and then postprocess 221those files to keep only the long enough reads. 222 223 224 cutadapt -m 30 -q 20 -o out.1.fastq.gz -p out.2.fastq.gz --too-short-output=tooshort.1.fastq.gz --too-short-paired-output=tooshort.2.fastq.gz in.1.fastq.gz in.2.fastq.gz 225 cutadapt -m 30 -o rescued.a.fastq.gz tooshort.1.fastq.gz 226 cutadapt -m 30 -o rescued.b.fastq.gz tooshort.2.fastq.gz 227 228The two output files ``rescued.a.fastq.gz`` and ``rescued.b.fastq.gz`` contain those individual 229reads that are long enough. Note that the file names do not end in ``.1.fastq.gz`` and 230``.2.fastq.gz`` to make it very clear that these files no longer contain synchronized paired-end 231reads. 232 233 234.. _bisulfite: 235 236Bisulfite sequencing (RRBS) 237--------------------------- 238 239When trimming reads that come from a library prepared with the RRBS (reduced 240representation bisulfite sequencing) protocol, the last two 3' bases must be 241removed in addition to the adapter itself. This can be achieved by using not 242the adapter sequence itself, but by adding two wildcard characters to its 243beginning. If the adapter sequence is ``ADAPTER``, the command for trimming 244should be:: 245 246 cutadapt -a NNADAPTER -o output.fastq input.fastq 247 248Details can be found in `Babraham bioinformatics' "Brief guide to 249RRBS" <http://www.bioinformatics.babraham.ac.uk/projects/bismark/RRBS_Guide.pdf>`_. 250A summary follows. 251 252During RRBS library preparation, DNA is digested with the restriction enzyme 253MspI, generating a two-base overhang on the 5' end (``CG``). MspI recognizes 254the sequence ``CCGG`` and cuts 255between ``C`` and ``CGG``. A double-stranded DNA fragment is cut in this way:: 256 257 5'-NNNC|CGGNNN-3' 258 3'-NNNGGC|CNNN-5' 259 260The fragment between two MspI restriction sites looks like this:: 261 262 5'-CGGNNN...NNNC-3' 263 3'-CNNN...NNNGGC-5' 264 265Before sequencing (or PCR) adapters can be ligated, the missing base positions 266must be filled in with GTP and CTP:: 267 268 5'-ADAPTER-CGGNNN...NNNCcg-ADAPTER-3' 269 3'-ADAPTER-gcCNNN...NNNGGC-ADAPTER-5' 270 271The filled-in bases, marked in lowercase above, do not contain any original 272methylation information, and must therefore not be used for methylation calling. 273By prefixing the adapter sequence with ``NN``, the bases will be automatically 274stripped during adapter trimming. 275 276 277.. _file-format-conversion: 278 279File format conversion 280---------------------- 281 282You can use Cutadapt to convert FASTQ to FASTA format:: 283 284 cutadapt -o output.fasta.gz input.fastq.gz 285 286Cutadapt detects that the file name extension of the output file is ``.fasta`` 287and writes in FASTA format, omitting the qualities. 288 289When writing to standard output, you need to use the ``--fasta`` option:: 290 291 cutadapt --fasta input.fastq.gz > out.fasta 292 293Without the option, Cutadapt writes in FASTQ format. 294 295 296Other things (unfinished) 297------------------------- 298 299* How to detect adapters 300* Use Cutadapt for quality-trimming only 301* Use it for minimum/maximum length filtering 302