1The \eslmod{msa} module reads and writes multiple sequence alignment 2files. The API is summarized in Table~\ref{tbl:msa_api}. 3 4The module uses two objects. An \ccode{ESL\_MSA} holds a multiple 5sequence alignment. A \ccode{ESL\_MSAFILE} is an alignment file, 6opened for input. No object is needed for output of an alignment file; 7a normal C \ccode{FILE} stream is used for output. 8 9MSAs can be handled in ``text mode'' or ``digital mode'', and 10converted back and forth between modes. 11 12Large MSA database files like Pfam or Rfam can be indexed with SSI, 13allowing fast random access. The \ccode{esl\_msafile\_Open()} and 14\ccode{esl\_msafile\_OpenDigital()} functions automatically open an 15accompanying SSI index, if it is present. 16 17% Table generated by autodoc -t esl_msa.c (so don't edit here, edit esl_msa.c:) 18\begin{table}[hbp] 19\begin{center} 20{\small 21\begin{tabular}{|ll|}\hline 22\apisubhead{The ESL\_MSA object }\\ 23\hyperlink{func:esl_msa_Create()}{\ccode{esl\_msa\_Create()}} & Creates an \ccode{ESL\_MSA} object.\\ 24\hyperlink{func:esl_msa_CreateFromString()}{\ccode{esl\_msa\_CreateFromString()}} & Creates a small \ccode{ESL\_MSA} from a test case string.\\ 25\hyperlink{func:esl_msa_Expand()}{\ccode{esl\_msa\_Expand()}} & Reallocate for more sequences.\\ 26\hyperlink{func:esl_msa_Destroy()}{\ccode{esl\_msa\_Destroy()}} & Frees an \ccode{ESL\_MSA}.\\ 27\apisubhead{The ESL\_MSAFILE object }\\ 28\hyperlink{func:esl_msafile_Open()}{\ccode{eslx\_msafile\_Open()}} & Open an MSA file for input.\\ 29\hyperlink{func:esl_msafile_Close()}{\ccode{eslx\_msafile\_Close()}} & Closes an open MSA file.\\ 30\apisubhead{Digital mode MSA's }\\ 31\hyperlink{func:esl_msa_GuessAlphabet()}{\ccode{esl\_msa\_GuessAlphabet()}} & Guess alphabet of MSA.\\ 32\hyperlink{func:esl_msa_CreateDigital()}{\ccode{esl\_msa\_CreateDigital()}} & Create a digital \ccode{ESL\_MSA}.\\ 33\hyperlink{func:esl_msa_Digitize()}{\ccode{esl\_msa\_Digitize()}} & Digitizes an msa, converting it from text mode.\\ 34\hyperlink{func:esl_msa_Textize()}{\ccode{esl\_msa\_Textize()}} & Convert a digital msa to text mode.\\ 35\hyperlink{func:esl_msafile_GuessAlphabet()}{\ccode{eslx\_msafile\_GuessAlphabet()}} & Guess what kind of sequences the alignment file contains.\\ 36\hyperlink{func:esl_msafile_OpenDigital()}{\ccode{eslx\_msafile\_OpenDigital()}} & Open an msa file for digital input.\\ 37\hyperlink{func:esl_msafile_SetDigital()}{\ccode{eslx\_msafile\_SetDigital()}} & Set an open \ccode{ESL\_MSAFILE} to read in digital mode.\\ 38\apisubhead{Random MSA database access}\\ 39\hyperlink{func:esl_msafile_PositionByKey()}{\ccode{eslx\_msafile\_PositionByKey()}} & Use SSI to reposition file to start of named MSA.\\ 40 41\apisubhead{General i/o API, all alignment formats }\\ 42%\hyperlink{func:esl_msa_Read()}{\ccode{esl\_msa\_Read()}} & Read next MSA from a file.\\ 43%\hyperlink{func:esl_msa_Write()}{\ccode{esl\_msa\_Write()}} & Write an MSA to a file.\\ 44%\hyperlink{func:esl_msa_GuessFileFormat()}{\ccode{esl\_msa\_GuessFileFormat()}} & Determine the format of an open MSA file.\\ 45\apisubhead{Miscellaneous functions for manipulating MSAs}\\ 46\hyperlink{func:esl_msa_SequenceSubset()}{\ccode{esl\_msa\_SequenceSubset()}} & Select subset of sequences into a smaller MSA.\\ 47\hyperlink{func:esl_msa_ColumnSubset()}{\ccode{esl\_msa\_ColumnSubset()}} & Remove a selected subset of columns from the MSA 48\\ 49\hyperlink{func:esl_msa_MinimGaps()}{\ccode{esl\_msa\_MinimGaps()}} & Remove columns containing all gym symbols.\\ 50\hyperlink{func:esl_msa_NoGaps()}{\ccode{esl\_msa\_NoGaps()}} & Remove columns containing any gap symbol.\\ 51\hyperlink{func:esl_msa_SymConvert()}{\ccode{esl\_msa\_SymConvert()}} & Global search/replace of symbols in an MSA.\\ 52\hyperlink{func:esl_msa_AddComment()}{\ccode{esl\_msa\_AddComment()}} & Description.\\ 53\hyperlink{func:esl_msa_AddGF()}{\ccode{esl\_msa\_AddGF()}} & Description.\\ 54\hyperlink{func:esl_msa_AddGS()}{\ccode{esl\_msa\_AddGS()}} & Description.\\ 55\hyperlink{func:esl_msa_AppendGC()}{\ccode{esl\_msa\_AppendGC()}} & Description.\\ 56\hyperlink{func:esl_msa_AppendGR()}{\ccode{esl\_msa\_AppendGR()}} & Description.\\ 57\hyperlink{func:esl_msa_Compare()}{\ccode{esl\_msa\_Compare()}} & Compare two MSAs for equality.\\ 58\hyperlink{func:esl_msa_CompareMandatory()}{\ccode{esl\_msa\_CompareMandatory()}} & Compare mandatory subset of MSA contents.\\ 59\hyperlink{func:esl_msa_CompareOptional()}{\ccode{esl\_msa\_CompareOptional()}} & Compare optional subset of MSA contents.\\ 60\hline 61\end{tabular} 62} 63\end{center} 64\caption{The \eslmod{msa} API.} 65\label{tbl:msa_api} 66\end{table} 67 68 69 70 71\subsection{Example of using msa} 72 73Here's an example of opening an MSA file and reading one or more 74alignments from it: 75 76%%\input{cexcerpts/msa_example} 77 78Some things about the use of the API in the example are worth noting: 79 80\begin{enumerate} 81\item The format of the alignment file can either be automatically 82 detected, or set by the caller when the file is opened. 83 Autodetection is invoked when the caller passes a format code 84 (here, \ccode{fmt}) of 85 \ccode{eslMSAFILE\_UNKNOWN}. Autodetection is a ``best effort'' 86 guess, but it is not 100\% reliable - especially if the input 87 file isn't an alignment file at all. So autodetection is a 88 convenient default, but the caller will probably want to provide 89 a way for the user to specify the input file format and override 90 autodetection, just in case. 91 92\item Errors can occur either in opening or reading the file that you 93 must check for. This error checking could be as simple as making 94 sure that \ccode{esl\_msafile\_Open()} and 95 \ccode{esl\_msa\_Read()} returned \ccode{eslOK}, but the example 96 shows how to catch all the normal errors returned by these 97 calls, and how to format some reasonably informative error 98 messages for the user. For example, when parsing the file fails 99 and \ccode{esl\_msa\_Read()} returns an \ccode{eslEFORMAT} 100 error, information about the problem is stored in \ccode{afp}: 101 the caller can use \ccode{afp->linenumber}, \ccode{afp->buf}, 102 and \ccode{afp->errbuf} to get the line number in the file that 103 the error occurred, the text that was on that line, and a short 104 error message about what was wrong with it, respectively. 105 106\item To output (write) an alignment, open a normal C \ccode{FILE} 107 stream, write the alignment(s) with \ccode{esl\_msa\_Write()}, 108 and close the stream with C's \ccode{fclose()}. Here, the 109 example is regurgitating the alignments it reads to 110 \ccode{stdout}. 111 112\end{enumerate} 113 114\subsection{Accessing alignment data} 115 116The information in the \ccode{ESL\_MSA} object is meant to be accessed 117directly, so you need to know what it contains. This object is defined 118and documented in \ccode{esl\_msa.h}. It contains various information, 119as follows: 120 121\subsubsection{Important/mandatory information} 122 123The following information is always available in an MSA (except 124digital-mode alignments, which replace \ccode{aseq[][]} with 125\ccode{ax[][]}, as described later): 126 127\input{cexcerpts/msa_mandatory} 128 129The alignment contains \ccode{nseq} sequences, each of which contains 130\ccode{alen} characters. 131 132\ccode{aseq[i]} is the i'th aligned sequence, numbered 133\ccode{0..nseq-1}. 134 135\ccode{aseq[i][j]} is the j'th character in aligned sequence i, 136numbered \ccode{0..alen-1}. 137 138\ccode{sqname[i]} is the name of the i'th sequence. 139 140\ccode{wgt[i]} is a non-negative real-valued weight for sequence 141i. This defaults to 1.0 if the alignment file did not provide weight 142data. You can determine whether weight data was parsed by checking 143\ccode{(flags \& eslMSA\_HASWGTS)}. 144 145 146 147\subsubsection{Optional information} 148 149The following information is optional. It is usually only provided by 150annotated Stockholm alignments (for instance, Pfam and Rfam database 151alignments): 152 153\input{cexcerpts/msa_optional} 154 155These should be self-explanatory; but for more information, see the 156Stockholm format documentation. Each of these fields corresponds to 157Stockholm markup. 158 159These pointers will be NULL for any optional annotation that was not 160present in the alignment file. This is true at any level; for 161instance, \ccode{ss} will be NULL if no secondary structures are 162available for any sequence, and \ccode{ss[i]} will be NULL if some 163secondary structures are available, but not for sequence i. 164 165The \ccode{cutoff} array contains Pfam/Rfam curated trusted, gathering 166and noise score cutoffs. They are indexed as follows: 167 168\input{cexcerpts/msa_cutoffs} 169 170 171 172\subsubsection{Unparsed information} 173 174The MSA object also stores additional ``unparsed'' information from 175Stockholm files; that is, tags that are present but not recognized by 176the MSA module. This information is stored so that it may be 177regurgitated if the application needs to faithfully output the entire 178alignment file, even the bits that it didn't understand. If you need 179to access unparsed Stockholm tags, see the comments in 180\ccode{esl\_msa.h}. 181 182 183 184\subsubsection{Off-by-one issues in indexing alignment columns} 185 186With one exception, all arrays over alignment columns are normal C 187string arrays, indexed \ccode{0..alen-1}. This includes optional 188information such as \ccode{msa->rf[]} (the reference annotation line) 189and \ccode{msa->cs[]} (the consensus structure annotation line). 190 191The exception is a digitized sequence alignment, \ccode{msa->ax[][]} 192(see below), where columns are indexed 1..alen and sentinel bytes at 193positions 0 and alen+1, following Easel's convention for digitized 194sequences. 195 196Thus, when your code is manipulating a digitized alignment and using 197optional information like the reference annotation line or the 198consensus structure line, you must be careful of the off-by-one 199difference in how the two types of data are indexed. 200 201\subsection{Accepted formats} 202 203Currently, the MSA module only parses Stockholm format. 204 205Stockholm format and other alignment formats are documented in a later 206chapter. 207 208\subsection{Digital versus text representation} 209 210The multiple alignment can be stored either in text or digital mode. 211 212A text-mode MSA stores ASCII text symbols in a 2D array \ccode{char ** 213 aseq[0..nseq-1][0..alen-1]}. These strings are stored exactly as 214they appeared in the original file; they aren't converted to upper or 215lower case, for example. 216 217A digital-mode MSA is digitized in the Easel internal alphabet. This 218enables more consistent, robust, and speedy handling of the sequence 219data. 220 221Text mode is the default behavior. An \ccode{ESL\_MSA} is in digital 222mode if its \ccode{eslMSA\_DIGITAL} flag is up (\ccode{msa->flags \& 223 eslMSA\_DIGITAL} is \ccode{TRUE}). When the alignment data are in 224digital mode, they are stored internally as a 2D digital sequence 225array \ccode{ESL\_DSQ ** ax[0..nseq-1][1..alen]}, and the \ccode{aseq} 226field is \ccode{NULL}. 227 228To use a digital internal representation, it is most efficient to read 229directly as digital data, using a \ccode{esl\_msafile\_OpenDigital()} 230call in place of \ccode{esl\_msafile\_Open()}. You can also change the 231mode of an MSA from text to digital using 232\ccode{esl\_msa\_Digitize()}, and digital to text using 233\ccode{esl\_msa\_Textize()}. 234 235Suppose you want to open an alignment file and read its alignments in 236digital mode, but you don't know whether the file contains DNA or 237protein alignments. You can't use \ccode{esl\_msafile\_OpenDigital()} 238unless you have an alphabet; but you can't see the alphabet until 239you've read an alignment. \Easel provides 240\ccode{esl\_msafile\_GuessAlphabet()} to peek at the first alignment 241and guess its alphabet\footnote{Because the stream that alignments are 242being read from may be non-rewindable, the implementation of 243\ccode{esl\_msafile\_GuessAlphabet()} reads and caches the first 244alignment.}, and \ccode{esl\_msafile\_SetDigital()} to set an 245already-open \ccode{ESL\_MSAFILE} so that all subsequent alignments 246are read in digital mode. For example: 247 248%%\input{cexcerpts/msa_example2} 249 250 251\subsection{Reading from stdin or gzip-compressed files} 252 253The module can read compressed alignment files. If the 254\ccode{filename} passed to \ccode{esl\_msafile\_Open()} ends in 255\ccode{.gz}, the file is assumed to be compressed with gzip. Instead 256of opening it normally, \ccode{esl\_msafile\_Open()} opens it as a pipe 257through \ccode{gzip -dc}. Obviously this only works on a POSIX 258system -- pipes have to work, specifically the \ccode{popen()} system 259call -- and \ccode{gzip} must be installed and in the PATH. 260 261The module can also read from a standard input pipe. If the 262\ccode{filename} passed to \ccode{esl\_msafile\_Open()} is \ccode{-}, 263the alignment is read from \ccode{STDIN} rather than from a file. 264 265Because of the way format autodetection works, you cannot use it when 266reading from a pipe or compressed file. The application must know the 267appropriate format and pass that code it calls 268\ccode{esl\_msafile\_Open()}. 269