1===============
2Recipes and FAQ
3===============
4
5This section gives answers to frequently asked questions. It shows you how to
6get Cutadapt to do what you want it to do!
7
8
9Remove more than one adapter
10----------------------------
11
12If you want to remove a 5' and 3' adapter at the same time, :ref:`use the
13support for linked adapters <linked-adapters>`.
14
15If your situation is different, for example, when you have many 5' adapters
16but only one 3' adapter, then you have two options.
17
18First, you can specify the adapters and also ``--times=2`` (or the short
19version ``-n 2``). For example::
20
21    cutadapt -g ^TTAAGGCC -g ^AAGCTTA -a TACGGACT -n 2 -o output.fastq input.fastq
22
23This instructs Cutadapt to run two rounds of adapter finding and removal. That
24means that, after the first round and only when an adapter was actually found,
25another round is performed. In both rounds, all given adapters are searched and
26removed. The problem is that it could happen that one adapter is found twice (so
27the 3' adapter, for example, could be removed twice).
28
29The second option is to not use the ``-n`` option, but to run Cutadapt twice,
30first removing one adapter and then the other. It is easiest if you use a pipe
31as in this example::
32
33    cutadapt -g ^TTAAGGCC -g ^AAGCTTA input.fastq | cutadapt -a TACGGACT - > output.fastq
34
35
36Trim poly-A tails
37-----------------
38
39If you want to trim a poly-A tail from the 3' end of your reads, use the 3'
40adapter type (``-a``) with an adapter sequence of many repeated ``A``
41nucleotides. Starting with version 1.8 of Cutadapt, you can use the
42following notation to specify a sequence that consists of 100 ``A``::
43
44    cutadapt -a "A{100}" -o output.fastq input.fastq
45
46This also works when there are sequencing errors in the poly-A tail. So this
47read ::
48
49    TACGTACGTACGTACGAAATAAAAAAAAAAA
50
51will be trimmed to::
52
53    TACGTACGTACGTACG
54
55If for some reason you would like to use a shorter sequence of ``A``, you can
56do so: The matching algorithm always picks the leftmost match that it can find,
57so Cutadapt will do the right thing even when the tail has more ``A`` than you
58used in the adapter sequence. However, sequencing errors may result in shorter
59matches than desired. For example, using ``-a "A{10}"``, the read above (where
60the ``AAAT`` is followed by eleven ``A``) would be trimmed to::
61
62    TACGTACGTACGTACGAAAT
63
64Depending on your application, perhaps a variant of ``-a A{10}N{90}`` is an
65alternative, forcing the match to be located as much to the left as possible,
66while still allowing for non-``A`` bases towards the end of the read.
67
68
69Trim a fixed number of bases after adapter trimming
70---------------------------------------------------
71
72If the adapters you want to remove are preceded by some unknown sequence (such
73as a random tag/molecular identifier), you can specify this as part of the
74adapter sequence in order to remove both in one go.
75
76For example, assume you want to trim Illumina adapters preceded by 10 bases
77that you want to trim as well. Instead of this command::
78
79    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ...
80
81Use this command::
82
83    cutadapt -O 13 -a N{10}AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ...
84
85The ``-O 13`` is the minimum overlap for an adapter match, where the 13 is
86computed as 3 plus 10 (where 3 is the default minimum overlap and 10 is the
87length of the unknown section). If you do not specify it, the adapter sequence
88would match the end of every read (because ``N`` matches anything), and ten
89bases would then be removed from every read.
90
91
92Trimming (amplicon-) primers from both ends of paired-end reads
93---------------------------------------------------------------
94
95If you want to remove primer sequences that flank your sequence of
96interest, you should use a :ref:`"linked adapter" <linked-adapters>`
97to remove them. If you have paired-end data (with R1 and R2), you
98can correctly trim both R1 and R2 by using linked adapters for both
99R1 and R2. Here is how to do this.
100
101The full DNA fragment that is put on the sequencer looks like this
102(looking only at the forward strand):
103
104   5' sequencing primer -- forward primer -- sequence of interest -- reverse complement of reverse primer -- reverse complement of 3' sequencing primer
105
106Since sequencing of R1 starts after the 5' sequencing primer, R1 will
107start with the forward primer and then continue into the sequence of
108interest and into the two primers to the right of it, depending on
109the read length and how long the sequence of interest is. For R1,
110the linked adapter option that needs to be used is therefore ::
111
112    -a FWDPRIMER...RCREVPRIMER
113
114where ``FWDPRIMER`` needs to be replaced with the sequence of your
115forward primer and ``RCREVPRIMER`` with the reverse complement of
116the reverse primer. The three dots ``...`` need to be entered
117as they are -- they tell Cutadapt that this is a linked adapter
118with a 5' and a 3' part.
119
120Sequencing of R2 starts before the 3' sequencing primer and
121proceeds along the reverse-complementary strand. For the correct
122linked adapter, the sequences from above therefore need to be
123swapped and reverse-complemented::
124
125    -A REVPRIMER...RCFWDPRIMER
126
127The uppercase ``-A`` specifies that this option is
128meant to work on R2. Similar to above, ``REVPRIMER`` is
129the sequence of the reverse primer and ``RCFWDPRIMER`` is the
130reverse-complement of the forward primer. Note that Cutadapt
131does not reverse-complement any sequences of its own; you
132will have to do that yourself.
133
134Finally, you may want to filter the trimmed read pairs.
135Use ``--discard-untrimmed`` to throw away all read pairs in
136which R1 doesn’t start with ``FWDPRIMER`` or in which R2
137does not start with ``REVPRIMER``.
138
139A note on how the filtering works: In linked adapters, by default
140the first part (before the ``...``) is anchored. Anchored
141sequences *must* occur. If they don’t, then the other sequence
142(after the ``...``) is not even searched for and the entire
143read is internally marked as “untrimmed”. This is done for both
144R1 and R2 and as soon as *any* of them is marked as “untrimmed”,
145the entire pair is considered to be “untrimmed”. If
146``--discard-untrimmed`` is used, this means that the entire
147pair is discarded if R1 or R2 are untrimmed. (Option
148``--pair-filter=both`` can be used to change this to require
149that *both* were marked as untrimmed.)
150
151In summary, this is how to trim your data and discard all
152read pairs that do not contain the primer sequences that
153you know must be there::
154
155    cutadapt -a FWDPRIMER...RCREVPRIMER -A REVPRIMER...RCFWDPRIMER --discard-untrimmed -o out.1.fastq.gz -p out.2.fastq.gz in.1.fastq.gz in.2.fastq.gz
156
157
158Piping paired-end data
159----------------------
160
161Sometimes it is necessary to run Cutadapt twice on your data. For example, when
162you want to change the order in which read modification or filtering options are
163applied. To simplify this, you can use Unix pipes (``|``), but this is more
164difficult with paired-end data since then input and output consists of two files
165each.
166
167The solution is to interleave the paired-end data, send it over the pipe
168and then de-interleave it in the other process. Here is how this looks in
169principle::
170
171    cutadapt [options] --interleaved in.1.fastq.gz in.2.fastq.gz | \
172      cutadapt [options] --interleaved -o out.1.fastq.gz -p out.2.fastq.gz -
173
174Note the ``-`` character in the second invocation to Cutadapt.
175
176
177Support for concatenated compressed files
178-----------------------------------------
179
180Cutadapt supports concatenated gzip and bzip2 input files.
181
182
183Paired-end read name check
184--------------------------
185
186When reading paired-end files, Cutadapt checks whether the read names match.
187Only the part of the read name before the first space is considered. If the
188read name ends with ``1`` or ``2``, then that is also ignored. For example,
189two FASTQ headers that would be considered to denote properly paired reads are::
190
191    @my_read/1 a comment
192
193and::
194
195    @my_read/2 another comment
196
197This is an example for *improperly paired* read names::
198
199    @my_read/1;1
200
201and::
202
203    @my_read/2;1
204
205Since the ``1`` and ``2`` are ignored only if the occur at the end of the read
206name, and since the ``;1`` is considered to be part of the read name, these
207reads will not be considered to be propely paired.
208
209
210Rescuing single reads from paired-end reads that were filtered
211--------------------------------------------------------------
212
213When trimming and filtering paired-end reads, Cutadapt always discards entire read pairs. If you
214want to keep one of the reads, you need to write the filtered read pairs to an output file and
215postprocess it.
216
217For example, assume you are using ``-m 30`` to discard too short reads. Cutadapt discards all
218read pairs in which just one of the reads is too short (but see the ``--pair-filter`` option).
219To recover those (individual) reads that are long enough, you can first use the
220``--too-short-(paired)-output`` options to write the filtered pairs to a file, and then postprocess
221those files to keep only the long enough reads.
222
223
224    cutadapt -m 30 -q 20 -o out.1.fastq.gz -p out.2.fastq.gz --too-short-output=tooshort.1.fastq.gz --too-short-paired-output=tooshort.2.fastq.gz in.1.fastq.gz in.2.fastq.gz
225    cutadapt -m 30 -o rescued.a.fastq.gz tooshort.1.fastq.gz
226    cutadapt -m 30 -o rescued.b.fastq.gz tooshort.2.fastq.gz
227
228The two output files ``rescued.a.fastq.gz`` and ``rescued.b.fastq.gz`` contain those individual
229reads that are long enough. Note that the file names do not end in ``.1.fastq.gz`` and
230``.2.fastq.gz`` to make it very clear that these files no longer contain synchronized paired-end
231reads.
232
233
234.. _bisulfite:
235
236Bisulfite sequencing (RRBS)
237---------------------------
238
239When trimming reads that come from a library prepared with the RRBS (reduced
240representation bisulfite sequencing) protocol, the last two 3' bases must be
241removed in addition to the adapter itself. This can be achieved by using not
242the adapter sequence itself, but by adding two wildcard characters to its
243beginning. If the adapter sequence is ``ADAPTER``, the command for trimming
244should be::
245
246    cutadapt -a NNADAPTER -o output.fastq input.fastq
247
248Details can be found in `Babraham bioinformatics' "Brief guide to
249RRBS" <http://www.bioinformatics.babraham.ac.uk/projects/bismark/RRBS_Guide.pdf>`_.
250A summary follows.
251
252During RRBS library preparation, DNA is digested with the restriction enzyme
253MspI, generating a two-base overhang on the 5' end (``CG``). MspI recognizes
254the sequence ``CCGG`` and cuts
255between ``C`` and ``CGG``. A double-stranded DNA fragment is cut in this way::
256
257    5'-NNNC|CGGNNN-3'
258    3'-NNNGGC|CNNN-5'
259
260The fragment between two MspI restriction sites looks like this::
261
262    5'-CGGNNN...NNNC-3'
263      3'-CNNN...NNNGGC-5'
264
265Before sequencing (or PCR) adapters can be ligated, the missing base positions
266must be filled in with GTP and CTP::
267
268    5'-ADAPTER-CGGNNN...NNNCcg-ADAPTER-3'
269    3'-ADAPTER-gcCNNN...NNNGGC-ADAPTER-5'
270
271The filled-in bases, marked in lowercase above, do not contain any original
272methylation information, and must therefore not be used for methylation calling.
273By prefixing the adapter sequence with ``NN``, the bases will be automatically
274stripped during adapter trimming.
275
276
277.. _file-format-conversion:
278
279File format conversion
280----------------------
281
282You can use Cutadapt to convert FASTQ to FASTA format::
283
284    cutadapt -o output.fasta.gz input.fastq.gz
285
286Cutadapt detects that the file name extension of the output file is ``.fasta``
287and writes in FASTA format, omitting the qualities.
288
289When writing to standard output, you need to use the ``--fasta`` option::
290
291    cutadapt --fasta input.fastq.gz > out.fasta
292
293Without the option, Cutadapt writes in FASTQ format.
294
295
296Other things (unfinished)
297-------------------------
298
299* How to detect adapters
300* Use Cutadapt for quality-trimming only
301* Use it for minimum/maximum length filtering
302