1.. sidebar:: ToC
2
3    .. contents::
4
5.. _tutorial-io-sam-bam-io:
6
7SAM and BAM I/O
8===============
9
10Learning Objective
11  In this tutorial, you will learn how to read and write SAM and BAM files.
12
13Difficulty
14  Average
15
16Duration
17  1 h (45 min if you know the SAM format)
18
19Prerequisites
20  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-io-input-output-overview`, `SAM Format Specification <http://samtools.sourceforge.net/SAM1.pdf>`_
21
22Overview
23--------
24
25.. warning::
26
27    Before you can read/write BAM files (bgzf compressed SAM files) you need to make sure that your program is linked against the zlib library.
28    When you build your application within the SeqAn build infrastructure, the zlib library is automatically located by looking at the standard places for the library.
29    Also have a look at :ref:`tutorial-io-input-output-overview-formatted-files` to read more about support of compressed file I/O.
30    If the macro ``SEQAN_HAS_ZLIB`` is set to ``0`` then reading/writing BAM file format is disabled.
31    It is set to ``1`` if the zlib could be found and reading/writing of compressed files is enabled automatically.
32    You can read :ref:`infra-use-cmake`, :ref:`infra-use-custom` and :ref:`infra-use-install-dependencies` for further notes about using the zlib and libbz2 in your build infrastructure.
33
34This tutorial shows how to read and write SAM and BAM files using the :dox:`BamFileIn` and :dox:`BamFileOut` classes.
35It starts out with a quick reminder on the structure of SAM (and also BAM) files and continues with how to read and write SAM/BAM files and access the tags of a record.
36
37.. important::
38
39    Note that this tutorial is targeted at readers that already know about the SAM format.
40    If you do not know about the SAM format yet, then this tutorial will be harder for your to understand.
41
42Both SAM and BAM files store multi-read alignments.
43Storing alignments of longer sequences such as contigs from assemblies is also possible, but less common.
44Here, we will focus on multi-read alignments.
45
46SAM files are text files, having one record per line.
47BAM files are just binary, compressed versions of SAM files that have a stricter organization and aim to be more efficiently usable by programs and computers.
48The nuts and bolts of the formats are described in the `SAM Format Specification <http://samtools.sourceforge.net/SAM1.pdf>`_.
49
50The SAM and BAM related I/O functionality in SeqAn focuses on allowing access to these formats in SeqAn with thin abstractions.
51The :ref:`tutorial-datastructures-store-fragment-store` Tutorial shows how to get a more high-level abstraction for multi-read alignments.
52
53.. important::
54
55    SAM/BAM I/O vs. Fragment Store
56
57    The :ref:`tutorial-datastructures-store-fragment-store` provides a high-level view of multi-read alignments.
58    This is very useful if you want to do SNP or small indel detection because you need to access the alignment of the reads around your candidate regions.
59    However, storing the whole alignment of a 120GB BAM file obviously is not a good idea.
60
61    The SAM/BAM I/O functionality in SeqAn is meant for sequentially reading through SAM and BAM files.
62    Jumping within BAM files using BAI indices is described in the `Using BAM Indices`_ section of this tutorial.
63
64
65SAM / BAM Format
66----------------
67
68The following shows an example of a SAM file.
69
70.. includefrags:: demos/tutorial/sam_and_bam_io/example.sam
71
72SAM files are TSV (tab-separated-values) files and begin with an optional header.
73The header consists of multiple lines, starting with an ``'@'`` character, each line is a record.
74Each record starts with its identifier and is followed by tab-separated tags.
75Each tag in the header consists of a two-character identifier, followed by ``':'``, followed by the value.
76
77If present, the ``@HD`` record must be the first record which specifies the SAM version (tag ``VN``) used in this file and the sort order (``SO``).
78The optional ``@SQ`` header records give the reference sequence names (tag ``SN``) and lengths (tag ``LN``).
79There also are other header record types.
80
81The optional header section is followed by the alignment records.
82The alignment records are again tab-separated.
83There are 11 mandatory columns.
84
85+-----------+-------------+--------------+-----------------+-------------------------------------------+
86| Col       | Field       | Type         | N/A Value       | Description                               |
87+===========+=============+==============+=================+===========================================+
88| 1         | QNAME       | string       | mandatory       | The query/read name.                      |
89+-----------+-------------+--------------+-----------------+-------------------------------------------+
90| 2         | FLAG        | int          | mandatory       | The record's flag.                        |
91+-----------+-------------+--------------+-----------------+-------------------------------------------+
92| 3         | RNAME       | string       | ``*``           | The reference name.                       |
93+-----------+-------------+--------------+-----------------+-------------------------------------------+
94| 4         | POS         | 32-bit int   | ``0``           | 1-based position on the reference.        |
95+-----------+-------------+--------------+-----------------+-------------------------------------------+
96| 5         | MAPQ        | 8-bit int    | ``255``         | The mapping quality.                      |
97+-----------+-------------+--------------+-----------------+-------------------------------------------+
98| 6         | CIGAR       | string       | ``*``           | The CIGAR string of the alignment.        |
99+-----------+-------------+--------------+-----------------+-------------------------------------------+
100| 7         | RNEXT       | string       | ``*``           | The reference of the next mate/segment.   |
101+-----------+-------------+--------------+-----------------+-------------------------------------------+
102| 8         | PNEXT       | string       | ``0``           | The position of the next mate/seqgment.   |
103+-----------+-------------+--------------+-----------------+-------------------------------------------+
104| 9         | TLEN        | string       | ``0``           | The observed length of the template.      |
105+-----------+-------------+--------------+-----------------+-------------------------------------------+
106| 10        | SEQ         | string       | ``*``           | The query/read sequence.                  |
107+-----------+-------------+--------------+-----------------+-------------------------------------------+
108| 11        | QUAL        | string       | ``*``           | The ASCII PHRED-encoded base qualities.   |
109+-----------+-------------+--------------+-----------------+-------------------------------------------+
110
111Notes:
112
113* The SAM standard talks about "queries".
114  In the context of read mapping, where the format originates, queries are reads.
115* The SAM standard talks about "templates" and "segments".
116  In the case of paired-end and mate-pair mapping the template consists of two segments, each is one read.
117  The template length is the insert size.
118* Paired-end reads are stored as two alignments records with the same QNAME.
119  The first and second mate are discriminated by the FLAG values.
120* When the FLAG indicates that SEQ is reverse-complemented, then QUAL is reversed.
121* Positions in the SAM file are 1-based.
122  When read into a :dox:`BamAlignmentRecord` (see below), the positions become 0-based.
123* The qualities must be stored as ASCII PHRED-encoded qualities.
124* The query and reference names must not contain whitespace.
125  It is common to trim query and reference ids at the first space.
126
127There are many ambiguities, recommendations, and some special cases in the formats that we do not describe here.
128We recommend that you follow this tutorial, start working with the SAM and BAM formats and later read the SAM specification "on demand" when you need it.
129
130The 11 mandatory columns are followed by an arbitrary number of optional tags.
131Tags have a two-character identifier followed by ``":${TYPE}:"``, followed by the tag's value.
132
133BAM files store their header as plain-text SAM headers.
134However, they additionally store the name and length information about the reference sequences.
135This information is mandatory since in BAM, the alignment records only contain the numeric ids of the reference sequences.
136Thus, the name is stored outside the record in the header.
137
138A First Working Example
139-----------------------
140
141The following program reads a file named ``example.sam`` and prints its contents back to the user on standard output.
142
143.. includefrags:: demos/tutorial/sam_and_bam_io/solution1.cpp
144
145.. includefrags:: demos/tutorial/sam_and_bam_io/solution1.cpp.stdout
146
147We instantiate a :dox:`BamFileIn` object for reading and a :dox:`BamFileOut` object for writing.
148First, we read the BAM header with :dox:`FormattedFileIn#readRecord` and we write it with :dox:`FormattedFileOut#writeRecord`.
149Then, we read each record from the input file and print it back on standard output.
150The alignment records are read into :dox:`BamAlignmentRecord` objects, which we will focus on below.
151
152Assignment 1
153""""""""""""
154
155.. container:: assignment
156
157   Type
158     Reproduction
159
160   Objective
161     Create a file with the sample SAM content from above and adjust the path ``"example.sam"`` to the path to your SAM file (e.g. ``"/path/to/my_example.sam"``).
162
163   Solution
164      .. container:: foldable
165
166         .. includefrags:: demos/tutorial/sam_and_bam_io/solution1.cpp
167
168
169Accessing the Header
170--------------------
171
172Sequence information (i.e. @SQ records) from the BAM header is stored in the :dox:`BamIOContext`.
173All remaining BAM header information is stored in the class :dox:`BamHeader`.
174
175.. important::
176   The header is not mandatory in SAM files and might be missing.
177
178The following program accesses the :dox:`BamIOContext` of its :dox:`BamFileIn` and prints the reference sequence names and lengths present in the BAM header.
179
180.. includefrags:: demos/tutorial/sam_and_bam_io/example2.cpp
181
182The output looks like this:
183
184.. includefrags:: demos/tutorial/sam_and_bam_io/example2.cpp.stdout
185
186Accessing the Records
187---------------------
188
189The class :dox:`BamAlignmentRecord` stores one alignment record of a SAM or BAM file.
190The class gives an in-memory representation that
191
192#. is independent of whether it comes from/goes to a SAM or BAM file,
193#. at the same time follows both formats closely,
194#. allows for efficient storage and usage in C++ and
195#. integrates well with the rest of the SeqAn library.
196
197The following definition gives an overview of the available fields, their types, and how they map to the SAM and BAM fields.
198Note that we use the :dox:`CigarElement` class to store entries in the CIGAR string.
199
200.. includefrags:: demos/tutorial/sam_and_bam_io/base.cpp
201      :fragment: bamRecord
202
203The static members ``INVALID_POS``, ``INVALID_REFID``, and ``INVALID_LEN`` store sentinel values for marking positions, reference sequence ids, and lengths as invalid or N/A.
204
205.. tip::
206   A :dox:`BamAlignmentRecord` is linked to a reference sequence by the field ``rID``.
207   The reference sequence information is stored in the BAM header and kept in the :dox:`BamIOContext`.
208   To easily access reference sequence name and and length relative to a given :dox:`BamAlignmentRecord` within a :dox:`BamFileIn`, use functions :dox:`BamAlignmentRecord#getContigName` and :dox:`BamAlignmentRecord#getContigLength`.
209
210An important related type is the enum :dox:`BamFlags` that provides constants for bit operations on the ``flag`` field.
211The functions :dox:`BamAlignmentRecord#hasFlagAllProper`, :dox:`BamAlignmentRecord#hasFlagDuplicate`, :dox:`BamAlignmentRecord#hasFlagFirst`, :dox:`BamAlignmentRecord#hasFlagLast`, :dox:`BamAlignmentRecord#hasFlagMultiple`, :dox:`BamAlignmentRecord#hasFlagNextRC`, :dox:`BamAlignmentRecord#hasFlagNextUnmapped`, :dox:`BamAlignmentRecord#hasFlagQCNoPass`, :dox:`BamAlignmentRecord#hasFlagRC`, :dox:`BamAlignmentRecord#hasFlagSecondary`, :dox:`BamAlignmentRecord#hasFlagUnmapped`, and :dox:`BamAlignmentRecord#hasFlagSupplementary` allow for easy reading of flags.
212
213
214
215Assignment 2
216""""""""""""
217
218.. container:: assignment
219
220   Counting Records
221
222   Type
223     Review
224
225   Objective
226     Count the number of unmapped reads.
227
228   Hints
229     Use the function :dox:`BamAlignmentRecord#hasFlagUnmapped`.
230
231   Solution
232     .. container:: foldable
233
234        .. includefrags:: demos/tutorial/sam_and_bam_io/solution2.cpp
235
236        .. includefrags:: demos/tutorial/sam_and_bam_io/solution2.cpp.stdout
237
238
239Accessing the Records' Tags
240---------------------------
241
242You can use the :dox:`BamTagsDict` class to access the the tag list of a record in a dictionary-like fashion.
243This class also performs the necessary casting when reading and writing tag list entries.
244
245:dox:`BamTagsDict` acts as a wrapper around the raw ``tags`` member of a :dox:`BamAlignmentRecord`, which is of type :dox:`CharString`:
246
247.. includefrags:: demos/tutorial/sam_and_bam_io/base.cpp
248      :fragment: BamTagsDict
249
250We can add a tag using the function :dox:`BamTagsDict#setTagValue`.
251When setting an already existing tag's value, its value will be overwritten.
252Note that in the following, we give the tags value in SAM format because it is easier to read, although they are stored in BAM format internally.
253
254.. includefrags:: demos/tutorial/sam_and_bam_io/base.cpp
255      :fragment: addTag
256
257The first parameter to :dox:`BamTagsDict#setTagValue` is the :dox:`BamTagsDict`, the second one is a two-character string with the key, and the third one is the value.
258Note that the type of tag entry will be taken automatically from the type of the third parameter.
259
260Reading values is slightly more complex because we have to handle the case that the value is not present.
261First, we get the index of the tag in the tag list.
262
263.. includefrags:: demos/tutorial/sam_and_bam_io/base.cpp
264      :fragment: getIndex
265
266Then, we can read the value from the :dox:`BamTagsDict` using the function :dox:`BamTagsDict#extractTagValue`.
267
268.. includefrags:: demos/tutorial/sam_and_bam_io/base.cpp
269      :fragment: extractValue
270
271The function returns a ``bool`` that is ``true`` on success and ``false`` otherwise.
272The extraction can fail if the index is out of bounds or the value in the dictionary cannot be cast to the type of the first parameter.
273
274The value in the tags dictionary will be casted to the type of the first parameter of :dox:`BamTagsDict#extractTagValue`:
275
276.. includefrags:: demos/tutorial/sam_and_bam_io/base.cpp
277      :fragment: cast
278
279Assignment 3
280""""""""""""
281
282.. container:: assignment
283
284   Reading Tags
285
286   Type
287     Review
288
289   Objective
290     Modify the solution of Assignment 2 to count the number of records having the ``"XX"`` tag.
291
292   Solution
293     .. container:: foldable
294
295        .. includefrags:: demos/tutorial/sam_and_bam_io/solution3.cpp
296
297        .. includefrags:: demos/tutorial/sam_and_bam_io/solution3.cpp.stdout
298
299Using BAM Indices
300-----------------
301
302SeqAn also contains features for reading BAM indices with the format ``.bai``. These indices can be built using the ``samtools index`` command. In the near future we plan to support building the bam index with SeqAn as well.
303
304You can read indices into a :dox:`BaiBamIndex` object with the function :dox:`BamIndex#open`. Then, you can use the function :dox:`BamFileIn#jumpToRegion` to jump to a specific position within BAM files. After jumping, the next record to be read is before the given region. Therefore, you have to skip records until you access the one you are looking for.
305
306.. includefrags:: demos/tutorial/sam_and_bam_io/example7.cpp
307
308.. includefrags:: demos/tutorial/sam_and_bam_io/example7.cpp.stdout
309
310Next Steps
311----------
312
313* Read the `SAM Format Specification <http://samtools.sourceforge.net/SAM1.pdf>`_.
314* Continue with the :ref:`tutorial`.
315