1.. sidebar:: ToC
2
3    .. contents::
4
5.. _tutorial-io-blast-io:
6
7Blast I/O
8=========
9
10Learning Objective
11  In this tutorial, you will learn about different the Blast file formats and how to interact with them in SeqAn.
12
13Difficulty
14  Average
15
16Duration
17  1h30min - 2h30min
18
19Prerequisite Tutorials
20  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-io-input-output-overview`, :ref:`tutorial-datastructures-alignment`, :ref:`tutorial-algorithms-alignment-pairwise-sequence-alignment`
21
22Other recommended reading
23  `Basics of Blast Statistics <http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html>`_
24
25Technical requirements
26  Full C++11 support required in compiler
27  (GCC :math:`\ge` 4.9, Clang :math:`\ge` 3.4 or MSVC :math:`\ge` 2015)
28
29
30The Basic local alignment search tool (BLAST) by the NCBI is one of the most widely used tools in Bioinformatics.
31It supports a variety of formats which, although widely used, are not standardized and partly even declared as
32"subject to unannounced and undocumented change". This makes implementing them very hard and is one of the reasons
33dealing with Blast IO is more difficult than with other file formats.
34
35Furthermore it is important to distinguish between the Blast version written in C (known by its `blastall` executable)
36and the C++ version (individual `blastn`, `blastp`... executables), also known as BLAST+.
37The formats and their identifiers changed between versions.
38The following table gives an overview:
39
40+---------------------------------------------------+------------------+-------------+
41| Description                                       | `blastall`       |  `blast*`   |
42+===================================================+==================+=============+
43| pairwise                                          |  -m 0            |  -outfmt 0  |
44+---------------------------------------------------+------------------+-------------+
45| query-anchored showing identities                 |  -m 1            |  -outfmt 1  |
46+---------------------------------------------------+------------------+-------------+
47| query-anchored no identities                      |  -m 2            |  -outfmt 2  |
48+---------------------------------------------------+------------------+-------------+
49| flat query-anchored, show identities              |  -m 3            |  -outfmt 3  |
50+---------------------------------------------------+------------------+-------------+
51| flat query-anchored, no identities                |  -m 4            |  -outfmt 4  |
52+---------------------------------------------------+------------------+-------------+
53| query-anchored no identities and blunt ends       |  -m 5            |             |
54+---------------------------------------------------+------------------+-------------+
55| flat query-anchored, no identities and blunt ends |  -m 6            |             |
56+---------------------------------------------------+------------------+-------------+
57| XML Blast output                                  |  -m 7            |  -outfmt 5  |
58+---------------------------------------------------+------------------+-------------+
59| tabular                                           |  -m 8            |  -outfmt 6  |
60+---------------------------------------------------+------------------+-------------+
61| tabular with comment lines                        |  -m 9            |  -outfmt 7  |
62+---------------------------------------------------+------------------+-------------+
63| Text ASN.1                                        |  -m 10           |  -outfmt 8  |
64+---------------------------------------------------+------------------+-------------+
65| Binary ASN.1                                      |  -m 11           |  -outfmt 9  |
66+---------------------------------------------------+------------------+-------------+
67| Comma-separated values                            |                  |  -outfmt 10 |
68+---------------------------------------------------+------------------+-------------+
69| BLAST archive format (ASN.1)                      |                  |  -outfmt 11 |
70+---------------------------------------------------+------------------+-------------+
71
72The files written by `blastall` are considered the
73"legacy"-format in SeqAn. SeqAn has support for the following formats:
74
75+-----------------------------+----------------+----------------+--------------------+
76| Format                      | read support   |  write support | based on version   |
77+=============================+================+================+====================+
78| pairwise                    |                |       ✓        |  Blast-2.2.26+     |
79+-----------------------------+----------------+----------------+--------------------+
80| tabular                     |      ✓         |       ✓        |  Blast-2.2.26+     |
81+-----------------------------+----------------+----------------+--------------------+
82| tabular (legacy)            |      ✓         |       ✓        |  Blast-2.2.26      |
83+-----------------------------+----------------+----------------+--------------------+
84| tabular w comments          |      ✓         |       ✓        |  Blast-2.2.26+     |
85+-----------------------------+----------------+----------------+--------------------+
86| tabular w comments (legacy) |      ✓         |       ✓        |  Blast-2.2.26      |
87+-----------------------------+----------------+----------------+--------------------+
88
89.. caution::
90
91    Please note that *Blast-2.2.26+* is **not the same** as *Blast-2.2.26*! One is version 2.2.26 of the C++
92    application suite (BLAST+) and the other is version 2.2.26 of the legacy application suite. There still are software
93    releases for both generations.
94
95There are different *program modes* in Blast which also influence the file format. In the legacy application suite
96these where specified with the ``-p`` parameter, in the BLAST+ suite they each have their own executable.
97
98+-----------------------------+------------------+--------------------+
99| Program mode                | query alphabet   |  subject alphabet  |
100+=============================+==================+====================+
101| BlastN                      | nucleotide       |  nucleotide        |
102+-----------------------------+------------------+--------------------+
103| BlastP                      | protein          |  protein           |
104+-----------------------------+------------------+--------------------+
105| BlastX                      | translated nucl. |  protein           |
106+-----------------------------+------------------+--------------------+
107| TBlastN                     | protein          |  translated nucl.  |
108+-----------------------------+------------------+--------------------+
109| TBlastX                     | translated nucl. |  translated nucl.  |
110+-----------------------------+------------------+--------------------+
111
112Tabular formats
113---------------
114
115The tabular formats are tab-seperated-value formats (TSV), with twelve columns by default.
116Each line represents one match (or *high scoring pair* in Blast terminology).
117The twelve default columns are:
118
119    1. Query sequence ID (truncated at first whitespace)
120    2. Subject sequence ID (truncated at first whitespace)
121    3. Percentage of identical positions
122    4. Alignment length
123    5. Number of mismatches
124    6. Number of gap openings
125    7. Start position of alignment on query sequence
126    8. End position of alignment on query sequence
127    9. Start position of alignment on subject sequence
128    10. End position of alignment on subject sequence
129    11. Expect value (length normalized bit score)
130    12. Bit score (statistical significance indicator)
131
132.. note:: Alignment positions in Blast
133
134   #. **Interval notation:** Blast uses 1-based closed intervals for positions, i.e. a match from the 100th position to
135      the 200th position of a sequence will be shown as ``100  200`` in the file. SeqAn internally uses 0-based
136      half open intervals, i.e. it starts counting at position 0 and stores the first position behind the sequence
137      as "end", e.g. position ``99`` and ``200`` for our example.
138   #. **Reverse strands:** For matches found on the reverse complement strand the positions are counted backwards from
139      the end of the sequence, e.g. a match from the 100th position to the 200th position on a reverse complement strand
140      of a sequence of length 500 will be shown as ``400 300`` in the file.
141   #. **Translation frames:** Positions given in the file are always on the original untranslated sequence!
142
143   The ``writeRecord()`` function automatically does all of these conversions!
144
145
146A **tabular** file could look like this (matches per query are sorted by e-value):
147
148.. literalinclude:: ../../../../tests/blast/defaultfields.m8
149
150The **tabular with comment lines** format additionally prefixes every block belonging to one query sequence with
151comment lines that include the program version, the database name and column labels. The above example would look
152like this:
153
154.. literalinclude:: ../../../../tests/blast/defaultfields.m9
155    :lines: 5-37
156
157As you can see, comment lines are also printed for query sequences which don't have any matches.
158The major
159difference of these formats in BLAST+ vs the legacy application are that the *mismatches* column used to
160include the number of gap characters, but it does not in BLAST+. The comments also look slightly different
161in the **tabular with comment lines (legacy)** format:
162
163.. literalinclude:: ../../../../tests/blast/defaultfields_legacy.m9
164    :lines: 5-35
165
166
167
168Pairwise format
169---------------
170
171The pairwise format is the default format in Blast. It is more verbose than the tabular formats and very human readable.
172
173This is what the last record from above would look like (the other queries are omitted):
174
175.. literalinclude:: ../../../../demos/tutorial/blast_io/plus_sub.m0
176
177
178Blast formats in SeqAn
179----------------------
180
181There are three blast format related tags in SeqAn:
182
183  #. :dox:`BlastReport` for the pairwise format with the :dox:`FormattedFile` output specialization
184     :dox:`BlastReportFileOut`.
185  #. :dox:`BlastTabular` for the tabular formats with the :dox:`FormattedFile` output and input specializations
186     :dox:`BlastTabularFileOut` and :dox:`BlastTabularFileIn`.
187  #. :dox:`BlastTabularLL` which provides light-weight, but very basic tabular IO.
188
189The third tag can be used for simple file manipulation, e.g. filtering or column rearrangement, it is not covered in
190this tutorial (see the dox for a simple example).
191
192To work with the first two formats you need to understand at least the following data structures:
193  * :dox:`BlastRecord`: the record covers all :dox:`BlastMatch` es belonging to one query sequence.
194  * :dox:`FormattedFile`: one of :dox:`BlastReportFileOut`, :dox:`BlastTabularFileOut` and :dox:`BlastTabularFileIn`.
195  * :dox:`BlastIOContext`: the context of the FormattedFile.
196
197The context contains file-global data like the name of the database and can also be used to read/write certain file
198format properties, e.g. "with comment lines" or "legacyFormat".
199
200.. caution::
201    Due to the structure of blast tabular files lots of information is repeated in every block of comment lines, e.g.
202    the database name. Because it is expected that these stay the same they are saved in the context and not the record.
203    You may still, however, check every time you ``readRecord()`` if you want to make sure.
204
205File reading example
206--------------------
207
208Only tabular formats are covered in this example, because no input support is available for the pairwise format.
209
210Copy the contents of the **tabular with comment lines** example above into a file and give it to the following
211program as the only parameter. Please use ``.m9`` as file type extension.
212
213.. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
214    :language: c++
215    :lines: 1-18, 74-84
216
217Assignment 1
218""""""""""""
219
220.. container:: assignment
221
222  Objective
223    Complete the above example by reading the file according to :dox:`BlastTabularFileIn`.
224    For every record print the query ID, the number of contained matches and the bit-score of the best match.
225
226  Solution
227      Top
228        .. container:: foldable
229
230            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
231                :language: c++
232                :lines: 1-18
233
234      New code
235        .. container:: foldable
236
237            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
238                :language: c++
239                :lines: 19-34, 57-60
240
241      Bottom
242          .. container:: foldable
243
244            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
245                :language: c++
246                :lines: 74-84
247
248Assignment 2
249""""""""""""
250
251.. container:: assignment
252
253  Objective
254    Study the documentation of :dox:`BlastIOContext`. How can you adapt the previous program to check if there were any
255    problems reading a record? If you have come up with a solution, try to read the file at
256    ``tests/blast/defaultfields.m9``. What does the program print and why?
257
258  Solution
259      Top
260        .. container:: foldable
261
262            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
263                :language: c++
264                :lines: 1-34
265
266      New code
267        .. container:: foldable
268
269            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
270                :language: c++
271                :lines: 41-60
272
273            The program will print conformancyErrors for the last record, because there is a typo in the file
274            ( ``Datacase`` instead of ``Database`` ).
275
276      Bottom
277          .. container:: foldable
278
279            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
280                :language: c++
281                :lines: 74-84
282
283Assignment 3
284""""""""""""
285
286.. container:: assignment
287
288  Objective
289    Now that you have a basic understanding of :dox:`BlastIOContext`, also print the following information after
290    reading the records:
291
292      * file format (with comment lines or without, BLAST+ or legacy?)
293      * blast program and version
294      * name of database
295
296    Verify that the results are as expected on the files ``tests/blast/defaultfields.m8``,
297    ``tests/blast/defaultfields.m9`` and ``tests/blast/defaultfields_legacy.m9``.
298
299  Solution
300      Top
301        .. container:: foldable
302
303            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
304                :language: c++
305                :lines: 1-34, 41-60
306
307      New code
308        .. container:: foldable
309
310            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
311                :language: c++
312                :lines: 61-73
313
314      Bottom
315          .. container:: foldable
316
317            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
318                :language: c++
319                :lines: 74-84
320
321Assignment 4
322""""""""""""
323
324As previously mentioned, twelve columns are printed by default.
325This can be changed in BLAST+, also by means of the ``--outfmt`` parameter.
326A standards compliant **file with comment lines** and custom column composition can be read
327without further configuration in SeqAn.
328
329.. tip::
330    Don't believe it? Look at ``tests/blast/customfields.m9``, as as you can see the bit score is in the 13th column
331    (instead of the twelfth). If you run your program on this file, it should still print the correct bit-scores!
332
333.. container:: assignment
334
335  Objective
336    Read :dox:`BlastIOContext` again focusing on :dox:`BlastIOContext::fields` and also read :dox:`BlastMatchField`.
337    Now adapt the previous program to print for every record the ``optionLabel`` of each field used.
338
339    Verify that the results are as expected on the files ``tests/blast/defaultfields.m9`` and
340    ``tests/blast/customfields.m9``.
341
342  Solution
343      Top
344        .. container:: foldable
345
346            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
347                :language: c++
348                :lines: 1-34
349
350      New code
351        .. container:: foldable
352
353            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
354                :language: c++
355                :lines: 35-40
356
357      Bottom
358          .. container:: foldable
359
360            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
361                :language: c++
362                :lines: 41-84
363
364      If this was too easy, you can also try the same for tabular files without comment lines!
365
366File writing example
367--------------------
368
369The following program stub creates three query sequences and two subject sequences in amino acid alphabet.
370We will later generate records with matches and print these to disk.
371
372.. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
373    :language: c++
374    :lines: 1-49, 105-114
375
376Assignment 5
377""""""""""""
378
379.. container:: assignment
380
381  Objective
382    Before we can begin to align, certain properties of the :dox:`BlastIOContext` have to be set.
383    Which ones? Add a block with the necessary initialization to the above code, use blast defaults where possible.
384    Although not strictly required at this point: include a call to ``writeHeader()``.
385
386    .. caution::
387        Alignment score computation works slightly different in Blast and in SeqAn, please have a look at
388        :dox:`BlastScoringScheme`. For the task at hand it should suffice to simply use the corresponding
389        ``set*()`` functions on :dox:`BlastIOContext::scoringScheme`.
390
391  Solution
392      Top
393        .. container:: foldable
394
395            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
396                :language: c++
397                :lines: 1-49
398
399      New code
400        .. container:: foldable
401
402            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
403                :language: c++
404                :lines: 50-64
405
406      Bottom
407          .. container:: foldable
408
409            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
410                :language: c++
411                :lines: 106-116
412
413Assignment 6
414""""""""""""
415
416.. container:: assignment
417
418  Objective
419        Next create a record for every query sequence, and in each record a match for every query-subject pair.
420        Compute the local alignment for each of those matches. Use the align member of the match object.
421
422  Solution
423      Top
424        .. container:: foldable
425
426            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
427                :language: c++
428                :lines: 1-64
429
430      New Code
431       .. container:: foldable
432
433            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
434                :language: c++
435                :lines: 65-82, 97-99, 103
436
437      Bottom
438       .. container:: foldable
439
440            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
441                :language: c++
442                :lines: 106-116
443
444Assignment 7
445""""""""""""
446
447.. container:: assignment
448
449  Objective
450        Now that you have the align member computed for every match, also save the begin and end positions, as well
451        as the lengths. Blast Output needs to now about the number of gaps, mismatches... of every match, how can
452        they be computed?
453
454  Solution
455      Top
456        .. container:: foldable
457
458            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
459                :language: c++
460                :lines: 1-82
461
462      New Code
463       .. container:: foldable
464
465            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
466                :language: c++
467                :lines: 83-92
468
469      Bottom
470       .. container:: foldable
471
472            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
473                :language: c++
474                :lines: 97-99, 103, 106-116
475
476Assignment 8
477""""""""""""
478
479.. container:: assignment
480
481  Objective
482        Finally add e-value statistics and print the results to a file:
483         * compute the bit score and e-value for every match
484         * discard matches with an e-value greater than 1
485         * for every record, sort the matches by bit-score
486         * write each record
487         * write the footer before exiting the program
488
489  Solution
490      Top
491       .. container:: foldable
492
493            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
494                :language: c++
495                :lines: 1-92
496
497      New Code
498       .. container:: foldable
499
500            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
501                :language: c++
502                :lines: 93-96
503
504      Bottom
505       .. container:: foldable
506
507            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
508                :language: c++
509                :lines: 97-116
510
511      Your output file should look like this:
512        .. container:: foldable
513
514            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m9
515
516Assignment 9
517""""""""""""
518
519.. container:: assignment
520
521  Objective
522    Up until now you have only printed the **tabular with comment lines** format. What do you have to do to print
523    without comment lines? In legacy format? What about the pairwise format?
524
525  Solution
526    Tabular without comment lines
527      .. container:: foldable
528
529        Add ``context(outfile).tabularSpec = BlastTabularSpec::NO_COMMENTS`` at l.53.
530        Remember to use ``.m8`` as file extension!
531
532        The result should look like this:
533
534        .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m8
535
536    Tabular without comment lines (legacy)
537      .. container:: foldable
538
539        To print in legacy tabular format (with or without comment lines), add ``context(outfile).legacyFormat = true`` at l.53.
540
541        The result should look like this(legacy and NO_COMMENTS):
542
543        .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment_legacy.m8
544
545    Pairwise format
546      .. container:: foldable
547
548        To print in the pairwise format replace l.52 with ``BlastReportFileOut<TContext> outfile(argv[1]);``.
549
550        Remember to use ``.m0`` as file extension!
551
552        The result should look like this:
553
554        .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m0
555