1The \eslmod{msa} module reads and writes multiple sequence alignment
2files. The API is summarized in Table~\ref{tbl:msa_api}.
3
4The module uses two objects. An \ccode{ESL\_MSA} holds a multiple
5sequence alignment. A \ccode{ESL\_MSAFILE} is an alignment file,
6opened for input. No object is needed for output of an alignment file;
7a normal C \ccode{FILE} stream is used for output.
8
9MSAs can be handled in ``text mode'' or ``digital mode'', and
10converted back and forth between modes.
11
12Large MSA database files like Pfam or Rfam can be indexed with SSI,
13allowing fast random access.  The \ccode{esl\_msafile\_Open()} and
14\ccode{esl\_msafile\_OpenDigital()} functions automatically open an
15accompanying SSI index, if it is present.
16
17% Table generated by autodoc -t esl_msa.c (so don't edit here, edit esl_msa.c:)
18\begin{table}[hbp]
19\begin{center}
20{\small
21\begin{tabular}{|ll|}\hline
22\apisubhead{The ESL\_MSA object                                           }\\
23\hyperlink{func:esl_msa_Create()}{\ccode{esl\_msa\_Create()}} & Creates an \ccode{ESL\_MSA} object.\\
24\hyperlink{func:esl_msa_CreateFromString()}{\ccode{esl\_msa\_CreateFromString()}} & Creates a small \ccode{ESL\_MSA} from a test case string.\\
25\hyperlink{func:esl_msa_Expand()}{\ccode{esl\_msa\_Expand()}} & Reallocate for more sequences.\\
26\hyperlink{func:esl_msa_Destroy()}{\ccode{esl\_msa\_Destroy()}} & Frees an \ccode{ESL\_MSA}.\\
27\apisubhead{The ESL\_MSAFILE object                                       }\\
28\hyperlink{func:esl_msafile_Open()}{\ccode{eslx\_msafile\_Open()}} & Open an MSA file for input.\\
29\hyperlink{func:esl_msafile_Close()}{\ccode{eslx\_msafile\_Close()}} & Closes an open MSA file.\\
30\apisubhead{Digital mode MSA's }\\
31\hyperlink{func:esl_msa_GuessAlphabet()}{\ccode{esl\_msa\_GuessAlphabet()}} & Guess alphabet of MSA.\\
32\hyperlink{func:esl_msa_CreateDigital()}{\ccode{esl\_msa\_CreateDigital()}} & Create a digital \ccode{ESL\_MSA}.\\
33\hyperlink{func:esl_msa_Digitize()}{\ccode{esl\_msa\_Digitize()}} & Digitizes an msa, converting it from text mode.\\
34\hyperlink{func:esl_msa_Textize()}{\ccode{esl\_msa\_Textize()}} & Convert a digital msa to text mode.\\
35\hyperlink{func:esl_msafile_GuessAlphabet()}{\ccode{eslx\_msafile\_GuessAlphabet()}} & Guess what kind of sequences the alignment file contains.\\
36\hyperlink{func:esl_msafile_OpenDigital()}{\ccode{eslx\_msafile\_OpenDigital()}} & Open an msa file for digital input.\\
37\hyperlink{func:esl_msafile_SetDigital()}{\ccode{eslx\_msafile\_SetDigital()}} & Set an open \ccode{ESL\_MSAFILE} to read in digital mode.\\
38\apisubhead{Random MSA database access}\\
39\hyperlink{func:esl_msafile_PositionByKey()}{\ccode{eslx\_msafile\_PositionByKey()}} & Use SSI to reposition file to start of named MSA.\\
40
41\apisubhead{General i/o API, all alignment formats                                 }\\
42%\hyperlink{func:esl_msa_Read()}{\ccode{esl\_msa\_Read()}} & Read next MSA from a file.\\
43%\hyperlink{func:esl_msa_Write()}{\ccode{esl\_msa\_Write()}} & Write an MSA to a file.\\
44%\hyperlink{func:esl_msa_GuessFileFormat()}{\ccode{esl\_msa\_GuessFileFormat()}} & Determine the format of an open MSA file.\\
45\apisubhead{Miscellaneous functions for manipulating MSAs}\\
46\hyperlink{func:esl_msa_SequenceSubset()}{\ccode{esl\_msa\_SequenceSubset()}} & Select subset of sequences into a smaller MSA.\\
47\hyperlink{func:esl_msa_ColumnSubset()}{\ccode{esl\_msa\_ColumnSubset()}} & Remove a selected subset of columns from the MSA
48\\
49\hyperlink{func:esl_msa_MinimGaps()}{\ccode{esl\_msa\_MinimGaps()}} & Remove columns containing all gym symbols.\\
50\hyperlink{func:esl_msa_NoGaps()}{\ccode{esl\_msa\_NoGaps()}} & Remove columns containing any gap symbol.\\
51\hyperlink{func:esl_msa_SymConvert()}{\ccode{esl\_msa\_SymConvert()}} & Global search/replace of symbols in an MSA.\\
52\hyperlink{func:esl_msa_AddComment()}{\ccode{esl\_msa\_AddComment()}} & Description.\\
53\hyperlink{func:esl_msa_AddGF()}{\ccode{esl\_msa\_AddGF()}} & Description.\\
54\hyperlink{func:esl_msa_AddGS()}{\ccode{esl\_msa\_AddGS()}} & Description.\\
55\hyperlink{func:esl_msa_AppendGC()}{\ccode{esl\_msa\_AppendGC()}} & Description.\\
56\hyperlink{func:esl_msa_AppendGR()}{\ccode{esl\_msa\_AppendGR()}} & Description.\\
57\hyperlink{func:esl_msa_Compare()}{\ccode{esl\_msa\_Compare()}} & Compare two MSAs for equality.\\
58\hyperlink{func:esl_msa_CompareMandatory()}{\ccode{esl\_msa\_CompareMandatory()}} & Compare mandatory subset of MSA contents.\\
59\hyperlink{func:esl_msa_CompareOptional()}{\ccode{esl\_msa\_CompareOptional()}} & Compare optional subset of MSA contents.\\
60\hline
61\end{tabular}
62}
63\end{center}
64\caption{The \eslmod{msa} API.}
65\label{tbl:msa_api}
66\end{table}
67
68
69
70
71\subsection{Example of using msa}
72
73Here's an example of opening an MSA file and reading one or more
74alignments from it:
75
76%%\input{cexcerpts/msa_example}
77
78Some things about the use of the API in the example are worth noting:
79
80\begin{enumerate}
81\item The format of the alignment file can either be automatically
82      detected, or set by the caller when the file is opened.
83      Autodetection is invoked when the caller passes a format code
84      (here, \ccode{fmt}) of
85      \ccode{eslMSAFILE\_UNKNOWN}. Autodetection is a ``best effort''
86      guess, but it is not 100\% reliable - especially if the input
87      file isn't an alignment file at all. So autodetection is a
88      convenient default, but the caller will probably want to provide
89      a way for the user to specify the input file format and override
90      autodetection, just in case.
91
92\item Errors can occur either in opening or reading the file that you
93      must check for. This error checking could be as simple as making
94      sure that \ccode{esl\_msafile\_Open()} and
95      \ccode{esl\_msa\_Read()} returned \ccode{eslOK}, but the example
96      shows how to catch all the normal errors returned by these
97      calls, and how to format some reasonably informative error
98      messages for the user. For example, when parsing the file fails
99      and \ccode{esl\_msa\_Read()} returns an \ccode{eslEFORMAT}
100      error, information about the problem is stored in \ccode{afp}:
101      the caller can use \ccode{afp->linenumber}, \ccode{afp->buf},
102      and \ccode{afp->errbuf} to get the line number in the file that
103      the error occurred, the text that was on that line, and a short
104      error message about what was wrong with it, respectively.
105
106\item To output (write) an alignment, open a normal C \ccode{FILE}
107      stream, write the alignment(s) with \ccode{esl\_msa\_Write()},
108      and close the stream with C's \ccode{fclose()}. Here, the
109      example is regurgitating the alignments it reads to
110      \ccode{stdout}.
111
112\end{enumerate}
113
114\subsection{Accessing alignment data}
115
116The information in the \ccode{ESL\_MSA} object is meant to be accessed
117directly, so you need to know what it contains. This object is defined
118and documented in \ccode{esl\_msa.h}. It contains various information,
119as follows:
120
121\subsubsection{Important/mandatory information}
122
123The following information is always available in an MSA (except
124digital-mode alignments, which replace \ccode{aseq[][]} with
125\ccode{ax[][]}, as described later):
126
127\input{cexcerpts/msa_mandatory}
128
129The alignment contains \ccode{nseq} sequences, each of which contains
130\ccode{alen} characters.
131
132\ccode{aseq[i]} is the i'th aligned sequence, numbered
133\ccode{0..nseq-1}.
134
135\ccode{aseq[i][j]} is the j'th character in aligned sequence i,
136numbered \ccode{0..alen-1}.
137
138\ccode{sqname[i]} is the name of the i'th sequence.
139
140\ccode{wgt[i]} is a non-negative real-valued weight for sequence
141i. This defaults to 1.0 if the alignment file did not provide weight
142data. You can determine whether weight data was parsed by checking
143\ccode{(flags \& eslMSA\_HASWGTS)}.
144
145
146
147\subsubsection{Optional information}
148
149The following information is optional. It is usually only provided by
150annotated Stockholm alignments (for instance, Pfam and Rfam database
151alignments):
152
153\input{cexcerpts/msa_optional}
154
155These should be self-explanatory; but for more information, see the
156Stockholm format documentation. Each of these fields corresponds to
157Stockholm markup.
158
159These pointers will be NULL for any optional annotation that was not
160present in the alignment file. This is true at any level; for
161instance, \ccode{ss} will be NULL if no secondary structures are
162available for any sequence, and \ccode{ss[i]} will be NULL if some
163secondary structures are available, but not for sequence i.
164
165The \ccode{cutoff} array contains Pfam/Rfam curated trusted, gathering
166and noise score cutoffs. They are indexed as follows:
167
168\input{cexcerpts/msa_cutoffs}
169
170
171
172\subsubsection{Unparsed information}
173
174The MSA object also stores additional ``unparsed'' information from
175Stockholm files; that is, tags that are present but not recognized by
176the MSA module. This information is stored so that it may be
177regurgitated if the application needs to faithfully output the entire
178alignment file, even the bits that it didn't understand. If you need
179to access unparsed Stockholm tags, see the comments in
180\ccode{esl\_msa.h}.
181
182
183
184\subsubsection{Off-by-one issues in indexing alignment columns}
185
186With one exception, all arrays over alignment columns are normal C
187string arrays, indexed \ccode{0..alen-1}. This includes optional
188information such as \ccode{msa->rf[]} (the reference annotation line)
189and \ccode{msa->cs[]} (the consensus structure annotation line).
190
191The exception is a digitized sequence alignment, \ccode{msa->ax[][]}
192(see below), where columns are indexed 1..alen and sentinel bytes at
193positions 0 and alen+1, following Easel's convention for digitized
194sequences.
195
196Thus, when your code is manipulating a digitized alignment and using
197optional information like the reference annotation line or the
198consensus structure line, you must be careful of the off-by-one
199difference in how the two types of data are indexed.
200
201\subsection{Accepted formats}
202
203Currently, the MSA module only parses Stockholm format.
204
205Stockholm format and other alignment formats are documented in a later
206chapter.
207
208\subsection{Digital versus text representation}
209
210The multiple alignment can be stored either in text or digital mode.
211
212A text-mode MSA stores ASCII text symbols in a 2D array \ccode{char **
213  aseq[0..nseq-1][0..alen-1]}. These strings are stored exactly as
214they appeared in the original file; they aren't converted to upper or
215lower case, for example.
216
217A digital-mode MSA is digitized in the Easel internal alphabet. This
218enables more consistent, robust, and speedy handling of the sequence
219data.
220
221Text mode is the default behavior. An \ccode{ESL\_MSA} is in digital
222mode if its \ccode{eslMSA\_DIGITAL} flag is up (\ccode{msa->flags \&
223  eslMSA\_DIGITAL} is \ccode{TRUE}). When the alignment data are in
224digital mode, they are stored internally as a 2D digital sequence
225array \ccode{ESL\_DSQ ** ax[0..nseq-1][1..alen]}, and the \ccode{aseq}
226field is \ccode{NULL}.
227
228To use a digital internal representation, it is most efficient to read
229directly as digital data, using a \ccode{esl\_msafile\_OpenDigital()}
230call in place of \ccode{esl\_msafile\_Open()}. You can also change the
231mode of an MSA from text to digital using
232\ccode{esl\_msa\_Digitize()}, and digital to text using
233\ccode{esl\_msa\_Textize()}.
234
235Suppose you want to open an alignment file and read its alignments in
236digital mode, but you don't know whether the file contains DNA or
237protein alignments. You can't use \ccode{esl\_msafile\_OpenDigital()}
238unless you have an alphabet; but you can't see the alphabet until
239you've read an alignment. \Easel provides
240\ccode{esl\_msafile\_GuessAlphabet()} to peek at the first alignment
241and guess its alphabet\footnote{Because the stream that alignments are
242being read from may be non-rewindable, the implementation of
243\ccode{esl\_msafile\_GuessAlphabet()} reads and caches the first
244alignment.}, and \ccode{esl\_msafile\_SetDigital()} to set an
245already-open \ccode{ESL\_MSAFILE} so that all subsequent alignments
246are read in digital mode. For example:
247
248%%\input{cexcerpts/msa_example2}
249
250
251\subsection{Reading from stdin or gzip-compressed files}
252
253The module can read compressed alignment files.  If the
254\ccode{filename} passed to \ccode{esl\_msafile\_Open()} ends in
255\ccode{.gz}, the file is assumed to be compressed with gzip. Instead
256of opening it normally, \ccode{esl\_msafile\_Open()} opens it as a pipe
257through \ccode{gzip -dc}. Obviously this only works on a POSIX
258system -- pipes have to work, specifically the \ccode{popen()} system
259call -- and \ccode{gzip} must be installed and in the PATH.
260
261The module can also read from a standard input pipe. If the
262\ccode{filename} passed to \ccode{esl\_msafile\_Open()} is \ccode{-},
263the alignment is read from \ccode{STDIN} rather than from a file.
264
265Because of the way format autodetection works, you cannot use it when
266reading from a pipe or compressed file. The application must know the
267appropriate format and pass that code it calls
268\ccode{esl\_msafile\_Open()}.
269