1.. sidebar:: ToC 2 3 .. contents:: 4 5.. _tutorial-io-blast-io: 6 7Blast I/O 8========= 9 10Learning Objective 11 In this tutorial, you will learn about different the Blast file formats and how to interact with them in SeqAn. 12 13Difficulty 14 Average 15 16Duration 17 1h30min - 2h30min 18 19Prerequisite Tutorials 20 :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-io-input-output-overview`, :ref:`tutorial-datastructures-alignment`, :ref:`tutorial-algorithms-alignment-pairwise-sequence-alignment` 21 22Other recommended reading 23 `Basics of Blast Statistics <http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html>`_ 24 25Technical requirements 26 Full C++11 support required in compiler 27 (GCC :math:`\ge` 4.9, Clang :math:`\ge` 3.4 or MSVC :math:`\ge` 2015) 28 29 30The Basic local alignment search tool (BLAST) by the NCBI is one of the most widely used tools in Bioinformatics. 31It supports a variety of formats which, although widely used, are not standardized and partly even declared as 32"subject to unannounced and undocumented change". This makes implementing them very hard and is one of the reasons 33dealing with Blast IO is more difficult than with other file formats. 34 35Furthermore it is important to distinguish between the Blast version written in C (known by its `blastall` executable) 36and the C++ version (individual `blastn`, `blastp`... executables), also known as BLAST+. 37The formats and their identifiers changed between versions. 38The following table gives an overview: 39 40+---------------------------------------------------+------------------+-------------+ 41| Description | `blastall` | `blast*` | 42+===================================================+==================+=============+ 43| pairwise | -m 0 | -outfmt 0 | 44+---------------------------------------------------+------------------+-------------+ 45| query-anchored showing identities | -m 1 | -outfmt 1 | 46+---------------------------------------------------+------------------+-------------+ 47| query-anchored no identities | -m 2 | -outfmt 2 | 48+---------------------------------------------------+------------------+-------------+ 49| flat query-anchored, show identities | -m 3 | -outfmt 3 | 50+---------------------------------------------------+------------------+-------------+ 51| flat query-anchored, no identities | -m 4 | -outfmt 4 | 52+---------------------------------------------------+------------------+-------------+ 53| query-anchored no identities and blunt ends | -m 5 | | 54+---------------------------------------------------+------------------+-------------+ 55| flat query-anchored, no identities and blunt ends | -m 6 | | 56+---------------------------------------------------+------------------+-------------+ 57| XML Blast output | -m 7 | -outfmt 5 | 58+---------------------------------------------------+------------------+-------------+ 59| tabular | -m 8 | -outfmt 6 | 60+---------------------------------------------------+------------------+-------------+ 61| tabular with comment lines | -m 9 | -outfmt 7 | 62+---------------------------------------------------+------------------+-------------+ 63| Text ASN.1 | -m 10 | -outfmt 8 | 64+---------------------------------------------------+------------------+-------------+ 65| Binary ASN.1 | -m 11 | -outfmt 9 | 66+---------------------------------------------------+------------------+-------------+ 67| Comma-separated values | | -outfmt 10 | 68+---------------------------------------------------+------------------+-------------+ 69| BLAST archive format (ASN.1) | | -outfmt 11 | 70+---------------------------------------------------+------------------+-------------+ 71 72The files written by `blastall` are considered the 73"legacy"-format in SeqAn. SeqAn has support for the following formats: 74 75+-----------------------------+----------------+----------------+--------------------+ 76| Format | read support | write support | based on version | 77+=============================+================+================+====================+ 78| pairwise | | ✓ | Blast-2.2.26+ | 79+-----------------------------+----------------+----------------+--------------------+ 80| tabular | ✓ | ✓ | Blast-2.2.26+ | 81+-----------------------------+----------------+----------------+--------------------+ 82| tabular (legacy) | ✓ | ✓ | Blast-2.2.26 | 83+-----------------------------+----------------+----------------+--------------------+ 84| tabular w comments | ✓ | ✓ | Blast-2.2.26+ | 85+-----------------------------+----------------+----------------+--------------------+ 86| tabular w comments (legacy) | ✓ | ✓ | Blast-2.2.26 | 87+-----------------------------+----------------+----------------+--------------------+ 88 89.. caution:: 90 91 Please note that *Blast-2.2.26+* is **not the same** as *Blast-2.2.26*! One is version 2.2.26 of the C++ 92 application suite (BLAST+) and the other is version 2.2.26 of the legacy application suite. There still are software 93 releases for both generations. 94 95There are different *program modes* in Blast which also influence the file format. In the legacy application suite 96these where specified with the ``-p`` parameter, in the BLAST+ suite they each have their own executable. 97 98+-----------------------------+------------------+--------------------+ 99| Program mode | query alphabet | subject alphabet | 100+=============================+==================+====================+ 101| BlastN | nucleotide | nucleotide | 102+-----------------------------+------------------+--------------------+ 103| BlastP | protein | protein | 104+-----------------------------+------------------+--------------------+ 105| BlastX | translated nucl. | protein | 106+-----------------------------+------------------+--------------------+ 107| TBlastN | protein | translated nucl. | 108+-----------------------------+------------------+--------------------+ 109| TBlastX | translated nucl. | translated nucl. | 110+-----------------------------+------------------+--------------------+ 111 112Tabular formats 113--------------- 114 115The tabular formats are tab-seperated-value formats (TSV), with twelve columns by default. 116Each line represents one match (or *high scoring pair* in Blast terminology). 117The twelve default columns are: 118 119 1. Query sequence ID (truncated at first whitespace) 120 2. Subject sequence ID (truncated at first whitespace) 121 3. Percentage of identical positions 122 4. Alignment length 123 5. Number of mismatches 124 6. Number of gap openings 125 7. Start position of alignment on query sequence 126 8. End position of alignment on query sequence 127 9. Start position of alignment on subject sequence 128 10. End position of alignment on subject sequence 129 11. Expect value (length normalized bit score) 130 12. Bit score (statistical significance indicator) 131 132.. note:: Alignment positions in Blast 133 134 #. **Interval notation:** Blast uses 1-based closed intervals for positions, i.e. a match from the 100th position to 135 the 200th position of a sequence will be shown as ``100 200`` in the file. SeqAn internally uses 0-based 136 half open intervals, i.e. it starts counting at position 0 and stores the first position behind the sequence 137 as "end", e.g. position ``99`` and ``200`` for our example. 138 #. **Reverse strands:** For matches found on the reverse complement strand the positions are counted backwards from 139 the end of the sequence, e.g. a match from the 100th position to the 200th position on a reverse complement strand 140 of a sequence of length 500 will be shown as ``400 300`` in the file. 141 #. **Translation frames:** Positions given in the file are always on the original untranslated sequence! 142 143 The ``writeRecord()`` function automatically does all of these conversions! 144 145 146A **tabular** file could look like this (matches per query are sorted by e-value): 147 148.. literalinclude:: ../../../../tests/blast/defaultfields.m8 149 150The **tabular with comment lines** format additionally prefixes every block belonging to one query sequence with 151comment lines that include the program version, the database name and column labels. The above example would look 152like this: 153 154.. literalinclude:: ../../../../tests/blast/defaultfields.m9 155 :lines: 5-37 156 157As you can see, comment lines are also printed for query sequences which don't have any matches. 158The major 159difference of these formats in BLAST+ vs the legacy application are that the *mismatches* column used to 160include the number of gap characters, but it does not in BLAST+. The comments also look slightly different 161in the **tabular with comment lines (legacy)** format: 162 163.. literalinclude:: ../../../../tests/blast/defaultfields_legacy.m9 164 :lines: 5-35 165 166 167 168Pairwise format 169--------------- 170 171The pairwise format is the default format in Blast. It is more verbose than the tabular formats and very human readable. 172 173This is what the last record from above would look like (the other queries are omitted): 174 175.. literalinclude:: ../../../../demos/tutorial/blast_io/plus_sub.m0 176 177 178Blast formats in SeqAn 179---------------------- 180 181There are three blast format related tags in SeqAn: 182 183 #. :dox:`BlastReport` for the pairwise format with the :dox:`FormattedFile` output specialization 184 :dox:`BlastReportFileOut`. 185 #. :dox:`BlastTabular` for the tabular formats with the :dox:`FormattedFile` output and input specializations 186 :dox:`BlastTabularFileOut` and :dox:`BlastTabularFileIn`. 187 #. :dox:`BlastTabularLL` which provides light-weight, but very basic tabular IO. 188 189The third tag can be used for simple file manipulation, e.g. filtering or column rearrangement, it is not covered in 190this tutorial (see the dox for a simple example). 191 192To work with the first two formats you need to understand at least the following data structures: 193 * :dox:`BlastRecord`: the record covers all :dox:`BlastMatch` es belonging to one query sequence. 194 * :dox:`FormattedFile`: one of :dox:`BlastReportFileOut`, :dox:`BlastTabularFileOut` and :dox:`BlastTabularFileIn`. 195 * :dox:`BlastIOContext`: the context of the FormattedFile. 196 197The context contains file-global data like the name of the database and can also be used to read/write certain file 198format properties, e.g. "with comment lines" or "legacyFormat". 199 200.. caution:: 201 Due to the structure of blast tabular files lots of information is repeated in every block of comment lines, e.g. 202 the database name. Because it is expected that these stay the same they are saved in the context and not the record. 203 You may still, however, check every time you ``readRecord()`` if you want to make sure. 204 205File reading example 206-------------------- 207 208Only tabular formats are covered in this example, because no input support is available for the pairwise format. 209 210Copy the contents of the **tabular with comment lines** example above into a file and give it to the following 211program as the only parameter. Please use ``.m9`` as file type extension. 212 213.. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 214 :language: c++ 215 :lines: 1-18, 74-84 216 217Assignment 1 218"""""""""""" 219 220.. container:: assignment 221 222 Objective 223 Complete the above example by reading the file according to :dox:`BlastTabularFileIn`. 224 For every record print the query ID, the number of contained matches and the bit-score of the best match. 225 226 Solution 227 Top 228 .. container:: foldable 229 230 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 231 :language: c++ 232 :lines: 1-18 233 234 New code 235 .. container:: foldable 236 237 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 238 :language: c++ 239 :lines: 19-34, 57-60 240 241 Bottom 242 .. container:: foldable 243 244 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 245 :language: c++ 246 :lines: 74-84 247 248Assignment 2 249"""""""""""" 250 251.. container:: assignment 252 253 Objective 254 Study the documentation of :dox:`BlastIOContext`. How can you adapt the previous program to check if there were any 255 problems reading a record? If you have come up with a solution, try to read the file at 256 ``tests/blast/defaultfields.m9``. What does the program print and why? 257 258 Solution 259 Top 260 .. container:: foldable 261 262 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 263 :language: c++ 264 :lines: 1-34 265 266 New code 267 .. container:: foldable 268 269 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 270 :language: c++ 271 :lines: 41-60 272 273 The program will print conformancyErrors for the last record, because there is a typo in the file 274 ( ``Datacase`` instead of ``Database`` ). 275 276 Bottom 277 .. container:: foldable 278 279 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 280 :language: c++ 281 :lines: 74-84 282 283Assignment 3 284"""""""""""" 285 286.. container:: assignment 287 288 Objective 289 Now that you have a basic understanding of :dox:`BlastIOContext`, also print the following information after 290 reading the records: 291 292 * file format (with comment lines or without, BLAST+ or legacy?) 293 * blast program and version 294 * name of database 295 296 Verify that the results are as expected on the files ``tests/blast/defaultfields.m8``, 297 ``tests/blast/defaultfields.m9`` and ``tests/blast/defaultfields_legacy.m9``. 298 299 Solution 300 Top 301 .. container:: foldable 302 303 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 304 :language: c++ 305 :lines: 1-34, 41-60 306 307 New code 308 .. container:: foldable 309 310 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 311 :language: c++ 312 :lines: 61-73 313 314 Bottom 315 .. container:: foldable 316 317 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 318 :language: c++ 319 :lines: 74-84 320 321Assignment 4 322"""""""""""" 323 324As previously mentioned, twelve columns are printed by default. 325This can be changed in BLAST+, also by means of the ``--outfmt`` parameter. 326A standards compliant **file with comment lines** and custom column composition can be read 327without further configuration in SeqAn. 328 329.. tip:: 330 Don't believe it? Look at ``tests/blast/customfields.m9``, as as you can see the bit score is in the 13th column 331 (instead of the twelfth). If you run your program on this file, it should still print the correct bit-scores! 332 333.. container:: assignment 334 335 Objective 336 Read :dox:`BlastIOContext` again focusing on :dox:`BlastIOContext::fields` and also read :dox:`BlastMatchField`. 337 Now adapt the previous program to print for every record the ``optionLabel`` of each field used. 338 339 Verify that the results are as expected on the files ``tests/blast/defaultfields.m9`` and 340 ``tests/blast/customfields.m9``. 341 342 Solution 343 Top 344 .. container:: foldable 345 346 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 347 :language: c++ 348 :lines: 1-34 349 350 New code 351 .. container:: foldable 352 353 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 354 :language: c++ 355 :lines: 35-40 356 357 Bottom 358 .. container:: foldable 359 360 .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp 361 :language: c++ 362 :lines: 41-84 363 364 If this was too easy, you can also try the same for tabular files without comment lines! 365 366File writing example 367-------------------- 368 369The following program stub creates three query sequences and two subject sequences in amino acid alphabet. 370We will later generate records with matches and print these to disk. 371 372.. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 373 :language: c++ 374 :lines: 1-49, 105-114 375 376Assignment 5 377"""""""""""" 378 379.. container:: assignment 380 381 Objective 382 Before we can begin to align, certain properties of the :dox:`BlastIOContext` have to be set. 383 Which ones? Add a block with the necessary initialization to the above code, use blast defaults where possible. 384 Although not strictly required at this point: include a call to ``writeHeader()``. 385 386 .. caution:: 387 Alignment score computation works slightly different in Blast and in SeqAn, please have a look at 388 :dox:`BlastScoringScheme`. For the task at hand it should suffice to simply use the corresponding 389 ``set*()`` functions on :dox:`BlastIOContext::scoringScheme`. 390 391 Solution 392 Top 393 .. container:: foldable 394 395 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 396 :language: c++ 397 :lines: 1-49 398 399 New code 400 .. container:: foldable 401 402 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 403 :language: c++ 404 :lines: 50-64 405 406 Bottom 407 .. container:: foldable 408 409 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 410 :language: c++ 411 :lines: 106-116 412 413Assignment 6 414"""""""""""" 415 416.. container:: assignment 417 418 Objective 419 Next create a record for every query sequence, and in each record a match for every query-subject pair. 420 Compute the local alignment for each of those matches. Use the align member of the match object. 421 422 Solution 423 Top 424 .. container:: foldable 425 426 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 427 :language: c++ 428 :lines: 1-64 429 430 New Code 431 .. container:: foldable 432 433 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 434 :language: c++ 435 :lines: 65-82, 97-99, 103 436 437 Bottom 438 .. container:: foldable 439 440 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 441 :language: c++ 442 :lines: 106-116 443 444Assignment 7 445"""""""""""" 446 447.. container:: assignment 448 449 Objective 450 Now that you have the align member computed for every match, also save the begin and end positions, as well 451 as the lengths. Blast Output needs to now about the number of gaps, mismatches... of every match, how can 452 they be computed? 453 454 Solution 455 Top 456 .. container:: foldable 457 458 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 459 :language: c++ 460 :lines: 1-82 461 462 New Code 463 .. container:: foldable 464 465 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 466 :language: c++ 467 :lines: 83-92 468 469 Bottom 470 .. container:: foldable 471 472 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 473 :language: c++ 474 :lines: 97-99, 103, 106-116 475 476Assignment 8 477"""""""""""" 478 479.. container:: assignment 480 481 Objective 482 Finally add e-value statistics and print the results to a file: 483 * compute the bit score and e-value for every match 484 * discard matches with an e-value greater than 1 485 * for every record, sort the matches by bit-score 486 * write each record 487 * write the footer before exiting the program 488 489 Solution 490 Top 491 .. container:: foldable 492 493 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 494 :language: c++ 495 :lines: 1-92 496 497 New Code 498 .. container:: foldable 499 500 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 501 :language: c++ 502 :lines: 93-96 503 504 Bottom 505 .. container:: foldable 506 507 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp 508 :language: c++ 509 :lines: 97-116 510 511 Your output file should look like this: 512 .. container:: foldable 513 514 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m9 515 516Assignment 9 517"""""""""""" 518 519.. container:: assignment 520 521 Objective 522 Up until now you have only printed the **tabular with comment lines** format. What do you have to do to print 523 without comment lines? In legacy format? What about the pairwise format? 524 525 Solution 526 Tabular without comment lines 527 .. container:: foldable 528 529 Add ``context(outfile).tabularSpec = BlastTabularSpec::NO_COMMENTS`` at l.53. 530 Remember to use ``.m8`` as file extension! 531 532 The result should look like this: 533 534 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m8 535 536 Tabular without comment lines (legacy) 537 .. container:: foldable 538 539 To print in legacy tabular format (with or without comment lines), add ``context(outfile).legacyFormat = true`` at l.53. 540 541 The result should look like this(legacy and NO_COMMENTS): 542 543 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment_legacy.m8 544 545 Pairwise format 546 .. container:: foldable 547 548 To print in the pairwise format replace l.52 with ``BlastReportFileOut<TContext> outfile(argv[1]);``. 549 550 Remember to use ``.m0`` as file extension! 551 552 The result should look like this: 553 554 .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m0 555