1\chapter{Input files and formats}
2\label{chapter:formats}
3\setcounter{footnote}{0}
4
5\section{Reading from files, compressed files, and pipes}
6
7Generally, HMMER programs read their sequence and/or profile input
8from files. Unix power users often find it convenient to string an
9incantation of commands together with pipes.\sidenote{Indeed, such
10  wizardly incantations are a point of pride.} For example, you might
11extract a subset of query sequences from a larger file using a
12one-liner combination of scripting commands (python, perl, awk,
13whatever). To facilitate the use of HMMER programs in such
14incantations, you can almost always use an argument of '-' (dash) in
15place of a filename, and the program will take its input from a
16standard input pipe instead of opening a file.
17
18For example, the following three commands are equivalent, and give
19essentially identical output:
20
21   \vspace{1ex}
22   \user{\% hmmsearch globins4.hmm uniprot\_sprot.fasta}  \\
23   \user{\% cat globins4.hmm | hmmsearch - uniprot\_sprot.fasta} \\
24   \user{\% cat uniprot\_sprot.fasta | hmmsearch globins4.hmm - } \\
25   \vspace{1ex}
26
27Most Easel ``miniapp'' programs share the same pipe-reading ability.
28
29Because the programs for profile HMM fetching (\mono{hmmfetch}) and
30sequence fetching (\mono{esl-sfetch}) can fetch any number of profiles
31or sequences by names/accessions given in a list, \emph{and} these
32programs can also read these lists from a stdin pipe, you can craft
33incantations that generate subsets of queries or targets on the
34fly. For example:
35
36   \vspace{1ex}
37   \begin{fullwidth}
38   \user{\indent \% esl-sfetch --index uniprot\_sprot.fasta} \\
39   \user{\% cat mytargs.list | esl-sfetch -f uniprot\_sprot.fasta - | hmmsearch globins4.hmm -} \\
40   \end{fullwidth}
41   \vspace{1ex}
42
43This takes a list of sequence names/accessions in
44\mono{mytargets.list}, fetches them one by one from UniProt (note that
45we index the UniProt file first, for fast retrieval; and note that
46\mono{esl-sfetch} is reading its \mono{<namefile>} list of
47names/accessions through a pipe using the '-' argument), and pipes
48them to an \mono{hmmsearch}. It should be obvious from this that we
49can replace the \mono{cat mytargets.list} with \emph{any} incantation
50that generates a list of sequence names/accessions (including SQL
51database queries).
52
53Ditto for piping subsets of profiles. Supposing you have a copy of Pfam in Pfam-A.hmm:
54
55   \vspace{1ex}
56   \begin{fullwidth}
57   \user{\indent \% hmmfetch --index Pfam-A.hmm} \\
58   \user{\% cat myqueries.list | hmmfetch -f Pfam.hmm - | hmmsearch - uniprot\_sprot.fasta}\\
59   \end{fullwidth}
60   \vspace{1ex}
61
62This takes a list of query profile names/accessions in
63\mono{myqueries.list}, fetches them one by one from Pfam, and does an
64hmmsearch with each of them against UniProt. As above, the \mono{cat
65  myqueries.list} part can be replaced by any suitable incantation
66that generates a list of profile names/accessions.
67
68There are three kinds of cases where using '-' is restricted or
69doesn't work. A fairly obvious restriction is that you can only use
70one '-' per command; you can't do a \mono{hmmsearch - -} that tries to
71read both profile queries and sequence targets through the same stdin
72pipe. Second, another case is when an input file must be obligately
73associated with additional, separately generated auxiliary files, so
74reading data from a single stream using '-' doesn't work because the
75auxiliary files aren't present (in this case, using '-' will be
76prohibited by the program). An example is \mono{hmmscan}, which needs
77its \mono{<hmmfile>} argument to be associated with four auxiliary
78files named \mono{<hmmfile>.h3\{mifp\}} that \mono{hmmpress} creates,
79so \mono{hmmscan} does not permit a '-' for its \mono{<hmmfile>}
80argument. Finally, when a command would require multiple passes over
81an input file, the command will generally abort after the first pass
82if you are trying to read that file through a standard input pipe
83(pipes are nonrewindable in general; a few HMMER or Easel programs
84will buffer input streams to make multiple passes possible, but this
85is not usually the case). An example would be trying to search a file
86containing multiple profile queries against a streamed target
87database:
88
89   \vspace{1ex}
90   \begin{fullwidth}
91   \user{\indent \% cat myqueries.list | hmmfetch -f Pfam.hmm > many.hmms}\\
92   \user{\% cat mytargets.list | esl-sfetch -f uniprot\_sprot.fasta - | hmmsearch many.hmms -}\\
93   \end{fullwidth}
94   \vspace{1ex}
95
96This will fail. Unfortunately the above business about how it will
97``generally abort after the first pass'' means it fails weirdly. The
98first query profile search will succeed, and its output will appear;
99then an error message will be generated when \mono{hmmsearch} sees the
100\emph{second} profile query and oops, suddenly realizes it is unable
101to rewind the target sequence database stream. This is inherent in how
102it reads the profile HMM query file sequentially as a stream (which is
103what's allowing it to read input from stdin pipes in the first place),
104one model at a time: it doesn't see there's more than one query model
105in the file until it gets to the second model.
106
107This case isn't too restricting because the same end goal can be
108achieved by reordering the commands. In cases where you want to do
109multiple queries against multiple targets, you always want to be
110reading the \emph{queries} from a stdin pipe, not the targets:
111
112   \vspace{1ex}
113   \user{\% cat mytargets.list | esl-sfetch -f uniprot\_sprot.fasta > mytarget.seqs}\\
114   \user{\% cat myqueries.list | hmmfetch -f Pfam.hmm - |  hmmsearch - mytarget.seqs}\\
115   \vspace{1ex}
116
117So in this multiple queries/multiple targets case of using stdin
118pipes, you just have to know, for any given program, which file it
119considers to be queries and which it considers to be targets. (That
120is, the logic in searching many queries against many targets is ``For
121each query: search the target database; then rewind the target
122database to the beginning.'') For \mono{hmmsearch}, the profiles are
123queries and sequences are targets. For \mono{hmmscan}, the reverse.
124
125In general, HMMER and Easel programs document in their man page
126whether (and which) command line arguments can be replaced by '-'.
127You can always check by trial and error, too. The worst that can
128happen is a ``Failed to open file -'' error message, if the program
129can't read from pipes.
130
131\subsection{.gz compressed files}
132
133In general, HMMER programs and Easel miniapps can also read \mono{.gz}
134compressed files; they will uncompress them on the fly. You need to
135have \mono{gunzip} installed on your system for this to work.
136
137
138
139
140%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
141\newpage
142\section{HMMER profile HMM files}
143\label{section:savefiles}
144
145A HMMER profile file looks like this, with ...'s marking elisions made
146for clarity and space:\sidenote{This is the \mono{globins4.hmm}
147  profile from the tutorial.}
148
149   \xsreoutput{inclusions/hmmbuild-globins.out2}
150
151A profile file consists of one or more profiles.  Each profile starts
152with a format version identifier (here, \mono{HMMER3/\HMMERfmtversion{}}) and ends with
153\mono{//} on a line by itself.  The format version identifier allows
154backward compatibility as the HMMER software evolves: it tells the
155parser this file is from HMMER3's save file format version
156\HMMERfmtversion{}.\footnote{HMMER 3.0 used 3/b format. HMMER 3.1 and 3.2 use 3/f
157  format.  Some alpha test versions of 3.0 used 3/a format. Internal
158  development versions of 3.1 used 3/c, 3/d, and 3/e formats.}  The
159closing \mono{//} allows multiple profiles to be concatenated.
160
161The format is divided into two regions. The first region contains
162textual information and miscalleneous parameters in a roughly
163tag-value scheme.  This section ends with a line beginning with the
164keyword \mono{HMM}. The second region is a tabular, whitespace-limited
165format for the main model parameters.
166
167All probability parameters are all stored as negative natural log
168probabilities with five digits of precision to the right of the
169decimal point, rounded. For example, a probability of $0.25$ is stored
170as $-\log 0.25 = 1.38629$. The special case of a zero probability is
171stored as '*'.
172
173Spacing is arranged for human readability, but the parser only cares
174that fields are separated by at least one space character.
175
176A more detailed description of the format follows.
177
178\subsection{header section}
179
180The header section is parsed line by line in a tag/value format. Each
181line type is either \textbf{mandatory} or \textbf{optional} as
182indicated.
183
184\begin{sreitems}{\monob{header}}
185
186\item [\monob{HMMER3/\HMMERfmtversion{}}] Unique identifier for the save file format
187  version; the \mono{/\HMMERfmtversion{}} means that this is HMMER3 profile file format
188  version \HMMERfmtversion{}. When HMMER3 changes its save file format, the revision
189  code advances. This way, parsers can be backwards
190  compatible. The remainder of the line after the \mono{HMMER3/\HMMERfmtversion{}} tag
191  is free text that is ignored by the parser. HMMER currently writes
192  its version number and release date in brackets here,
193  e.g. \mono{\HMMERsavestamp{}} in this
194  example. \textbf{Mandatory.}
195
196\item [\monob{NAME <s>}] Model name; \mono{<s>} is a single word
197containing no spaces or tabs. The name is normally picked up from the
198\mono{\#=GF ID} line from a Stockholm alignment file.  If this is not
199present, the name is created from the name of the alignment file by
200removing any file type suffix. For example, an otherwise nameless HMM
201built from the alignment file \mono{rrm.slx} would be named
202\mono{rrm}.  \textbf{Mandatory.}
203
204\item [\monob{ACC <s>}] Accession number; \mono{<s>} is a one-word
205accession number. This is picked up from the \mono{\#=GF AC} line in a
206Stockholm format alignment. \textbf{Optional.}
207
208\item [\monob{DESC <s>}] Description line; \mono{<s>} is a one-line
209free text description. This is picked up from the \mono{\#=GF DE} line
210in a Stockholm alignment file. \textbf{Optional.}
211
212\item [\monob{LENG <d>}] Model length; \mono{<d>}, a positive nonzero
213integer, is the number of match states in the model.
214\textbf{Mandatory.}
215
216\item [\monob{MAXL <d>}] Max instance length; \mono{<d>}, a positive
217nonzero integer, is the upper bound on the length at which and instance
218of the model is expected to be found. Used only by nhmmer and nhmmscan.
219\textbf{Optional.}
220
221\item [\monob{ALPH <s>}] Symbol alphabet type. For biosequence
222analysis models, \mono{<s>} is \mono{amino}, \mono{DNA}, or \mono{RNA}
223(case insensitive). There are also other accepted alphabets for
224purposes beyond biosequence analysis, including \mono{coins},
225\mono{dice}, and \mono{custom}. This determines the symbol alphabet
226and the size of the symbol emission probability distributions.  If
227\mono{amino}, the alphabet size $K$ is set to 20 and the symbol
228alphabet to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order); if
229\mono{DNA}, the alphabet size $K$ is set to 4 and the symbol alphabet
230to ``ACGT''; if \mono{RNA}, the alphabet size $K$ is set to 4 and the
231symbol alphabet to ``ACGU''. \textbf{Mandatory.}
232
233\item [\monob{RF <s>}] Reference annotation flag; \mono{<s>} is
234either \mono{no} or \mono{yes} (case insensitive). If \mono{yes}, the
235reference annotation character field for each match state in the main
236model (see below) is valid; if \mono{no}, these characters are
237ignored.  Reference column annotation is picked up from a Stockholm
238alignment file's \mono{\#=GC RF} line. It is propagated to alignment
239outputs, and also may optionally be used to define consensus match
240columns in profile HMM construction. \textbf{Optional}; assumed to be
241no if not present.
242
243\item [\monob{MM <s>}] Model masked flag; \mono{<s>} is
244either \mono{no} or \mono{yes} (case insensitive). If \mono{yes}, the
245model mask annotation character field for each match state in the main
246model (see below) is valid; if \mono{no}, these characters are
247ignored. Indicates that the profile model was created such that
248emission probabilities at masked positions are set to match the
249background frequency, rather than being set based on observed frequencies
250in the alignment. Position-specific insertion and deletion rates are not
251altered, even in masked regions. \textbf{Optional}; assumed to be
252no if not present.
253
254\item [\monob{CONS <s>}] Consensus residue annotation flag;
255  \mono{<s>} is either \mono{no} or \mono{yes} (case insensitive).  If
256  \mono{yes}, the consensus residue field for each match state in the
257  main model (see below) is valid. If \mono{no}, these characters are
258  ignored. Consensus residue annotation is determined when models are
259  built. For models of single sequences, the consensus is the same as
260  the query sequence. For models of multiple alignments, the consensus
261  is the maximum likelihood residue at each position. Upper case
262  indicates that the model's emission probability for the consensus
263  residue is $\geq$ an arbitrary threshold (0.5 for protein models,
264  0.9 for DNA/RNA models).
265
266\item [\monob{CS <s>}] Consensus structure annotation flag;
267\mono{<s>} is either \mono{no} or \mono{yes} (case insensitive). If
268\mono{yes}, the consensus structure character field for each match
269state in the main model (see below) is valid; if \mono{no} these
270characters are ignored. Consensus structure annotation is picked up
271from a Stockholm file's \mono{\#=GC SS\_cons} line, and propagated to
272alignment displays.  \textbf{Optional}; assumed to be no if not
273present.
274
275\item [\monob{MAP <s>}] Map annotation flag; \mono{<s>} is either
276\mono{no} or \mono{yes} (case insensitive).  If set to \mono{yes}, the
277map annotation field in the main model (see below) is valid; if
278\mono{no}, that field will be ignored.  The HMM/alignment map
279annotates each match state with the index of the alignment column from
280which it came. It can be used for quickly mapping any subsequent
281HMM alignment back to the original multiple alignment, via the model.
282\textbf{Optional}; assumed to be no if not present.
283
284\item [\monob{DATE <s>}] Date the model was constructed; \mono{<s>}
285is a free text date string.  This field is only used for logging
286purposes.\footnote{HMMER does not use dates for any purpose other than
287human-readable annotation, so it is no more prone than you are to Y2K,
288Y2038, or any other date-related eschatology.} \textbf{Optional.}
289
290\item [\monob{COM [<n>] <s>}] Command line log; \mono{<n>} counts
291command line numbers, and \mono{<s>} is a one-line command. There may
292be more than one \mono{COM} line per save file, each numbered starting
293from $n=1$. These lines record every HMMER command that modified the
294save file. This helps us reproducibly and automatically log how Pfam
295models have been constructed, for example. \textbf{Optional.}
296
297\item [\monob{NSEQ  <d>}] Sequence number; \mono{<d>} is a nonzero
298positive integer, the number of sequences that the HMM was trained on.
299This field is only used for logging purposes.
300\textbf{Optional.}
301
302\item [\monob{EFFN <f>}] Effective sequence number; \mono{<f>} is a
303nonzero positive real, the effective total number of sequences
304determined by \mono{hmmbuild} during sequence weighting, for combining
305observed counts with Dirichlet prior information in parameterizing the
306model. This field is only used for logging purposes.
307\textbf{Optional.}
308
309\item [\monob{CKSUM <d>}] Training alignment checksum; \mono{<d>} is
310  a nonnegative unsigned 32-bit integer. This number is calculated
311  from the training sequence data, and used in conjunction with the
312  alignment map information to verify that a given alignment is indeed
313  the alignment that the map is for. \textbf{Optional.}
314
315\item [\monob{GA    <f> <f>}] Pfam gathering thresholds GA1 and GA2.
316See Pfam documentation of GA lines. \textbf{Optional.}
317
318\item [\monob{TC <f> <f>}] Pfam trusted cutoffs TC1 and TC2.  See
319Pfam documentation of TC lines. \textbf{Optional.}
320
321\item [\monob{NC <f> <f>}] Pfam noise cutoffs NC1 and NC2.  See Pfam
322documentation of NC lines. \textbf{Optional.}
323
324\item [\monob{STATS <s1> <s2> <f1> <f2>}] Statistical parameters
325  needed for E-value calculations. \mono{<s1>} is the model's
326  alignment mode configuration: currently only \mono{LOCAL} is
327  recognized. \mono{<s2>} is the name of the score distribution:
328  currently \mono{MSV}, \mono{VITERBI}, and \mono{FORWARD} are
329  recognized.  \mono{<f1>} and \mono{<f2>} are two real-valued
330  parameters controlling location and slope of each distribution,
331  respectively; $\mu$ and $\lambda$ for Gumbel distributions for MSV
332  and Viterbi scores, and $\tau$ and $\lambda$ for exponential tails
333  for Forward scores.  $\lambda$ values must be positive.  All three
334  lines or none of them must be present: when all three are present,
335  the model is considered to be calibrated for E-value
336  statistics. \textbf{Optional.}
337
338\item [\monob{HMM }] Flags the start of the main model
339section. Solely for human readability of the tabular model data, the
340symbol alphabet is shown on the \mono{HMM} line, aligned to the fields
341of the match and insert symbol emission distributions in the main
342model below. The next line is also for human readability, providing
343column headers for the state transition probability fields in the main
344model section that follows. Though unparsed after the \mono{HMM} tag,
345the presence of two header lines is \textbf{mandatory:} the parser
346always skips the line after the \mono{HMM} tag line.
347
348\item [\monob{COMPO <f>*K}] The first line in the main model section
349may be an optional line starting with \monob{COMPO}: these are the
350model's overall average match state emission probabilities, which are
351used as a background residue composition in the ``filter null''
352model. The $K$ fields on this line are log probabilities for each
353residue in the appropriate biosequence alphabet's
354order. \textbf{Optional.}
355
356\end{sreitems}
357
358\subsection{main model section}
359
360All the remaining fields are \textbf{mandatory}.
361
362The first two lines in the main model section are
363atypical.\footnote{That is, the first two lines after the optional
364  \mono{COMPO} line. Don't be confused by the presence of an optional \mono{COMPO}
365  line here. The \mono{COMPO} line is placed in the model section, below the
366  residue column headers, because it's an array of numbers much like
367  residue scores, but it's not really part of the model.}  They
368contain information for the core model's BEGIN node. This is stored as
369model node 0, and match state 0 is treated as the BEGIN state.  The
370begin state is mute, so there are no match emission probabilities. The
371first line is the insert 0 emissions. The second line contains the
372transitions from the begin state and insert state 0.  These seven
373numbers are: $B \rightarrow M_1$, $B \rightarrow I_0$, $B \rightarrow
374D_1$; $I_0 \rightarrow M_1$, $I_0 \rightarrow I_0$; then a 0.0 and a
375'*', because by convention, nonexistent transitions from the
376nonexistent delete state 0 are set to $\log 1 = 0$ and $\log 0 =
377-\infty = $ `*'.
378
379The remainder of the model has three lines per node, for $M$ nodes
380(where $M$ is the number of match states, as given by the \mono{LENG}
381line). These three lines are ($K$ is the alphabet size in residues):
382
383\begin{sreitems}{\textbf{State transition line}}
384
385\item [\textbf{Match emission line}] The first field is the node
386number ($1 \ldots M$).  The parser verifies this number as a
387consistency check (it expects the nodes to come in order). The next
388$K$ numbers for match emissions, one per symbol, in alphabetic order.
389
390The next field is the \mono{MAP} annotation for this node.  If
391\mono{MAP} was \mono{yes} in the header, then this is an integer,
392representing the alignment column index for this match state
393(1..alen); otherwise, this field is `-'.
394
395The next field is the \mono{CONS} consensus residue for this node.  If
396\mono{CONS} was \mono{yes} in the header, then this is a single
397character, representing the consensus residue annotation for this
398match state; otherwise, this field is `-'.
399
400The next field is the \mono{RF} annotation for this node.  If
401\mono{RF} was \mono{yes} in the header, then this is a single
402character, representing the reference annotation for this match state;
403otherwise, this field is `-'.
404
405The next field is the \mono{MM} mask value for this node.  If
406\mono{MM} was \mono{yes} in the header, then this is a single 'm'
407character, indicating that the position was identified as a masked
408position during model construction; otherwise, this field is `-'.
409
410The next field is the \mono{CS} annotation for this node.  If
411\mono{CS} was \mono{yes}, then this is a single character,
412representing the consensus structure at this match state; otherwise
413this field is `-'.
414
415\item [\textbf{Insert emission line}] The $K$ fields on this line are
416the insert emission scores, one per symbol, in alphabetic order.
417
418\item [\textbf{State transition line}] The seven fields on this line
419are the transitions for node $k$, in the order shown by the transition
420header line: $M_k \rightarrow M_{k+1}, I_{k}, D_{k+1}$; $ I_k
421\rightarrow M_{k+1}, I_k$; $D_{k} \rightarrow M_{k+1}, D_{k+1}$.
422
423For transitions from the final node $M$, match state $M+1$ is
424interpreted as the END state $E$, and there is no delete state $M+1$;
425therefore the final $M_k \rightarrow D_{k+1}$ and $D_k \rightarrow
426D_{k+1}$ transitions are always * (zero probability), and the final
427$D_k \rightarrow M_{k+1}$ transition is always 0.0 (probability 1.0).
428\end{sreitems}
429
430Finally, the last line of the format is the ``//'' record separator.
431
432%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
433\newpage
434\section{Stockholm, the recommended multiple sequence alignment format}
435\label{section:stockholm}
436
437The Pfam and Rfam Consortiums have developed a multiple sequence
438alignment format called ``Stockholm format'' that allows rich and
439extensible annotation.
440
441Most popular multiple alignment file formats can be changed into a
442minimal Stockholm format file just by adding a Stockholm header line
443and a trailing \mono{//} terminator:
444
445    \xsreoutput{inclusions/globins4.sto}
446
447The first line in the file must be \mono{\# STOCKHOLM 1.x}, where
448\mono{x} is a minor version number for the format specification (and
449which currently has no effect on my parsers). This line allows a
450parser to instantly identify the file format.
451
452In the alignment, each line contains a name, followed by the aligned
453sequence. A dash, period, underscore, or tilde (but not whitespace)
454denotes a gap. If the alignment is too long to fit on one line, the
455alignment may be split into multiple blocks, with blocks separated by
456blank lines. The number of sequences, their order, and their names
457must be the same in every block. Within a given block, each
458(sub)sequence (and any associated \mono{\#=GR} and \mono{\#=GC} markup,
459see below) is of equal length, called the \emph{block length}. Block
460lengths may differ from block to block. The block length must be at
461least one residue, and there is no maximum.
462
463Other blank lines are ignored. You can add comments anywhere to the
464file (even within a block) on lines starting with a \mono{\#}.
465
466All other annotation is added using a tag/value comment style. The
467tag/value format is inherently extensible, and readily made
468backwards-compatible; unrecognized tags will simply be ignored. Extra
469annotation includes consensus and individual RNA or protein secondary
470structure, sequence weights, a reference coordinate system for the
471columns, and database source information including name, accession
472number, and coordinates (for subsequences extracted from a longer
473source sequence) See below for details.
474
475\subsection{syntax of Stockholm markup}
476
477There are four types of Stockholm markup annotation, for per-file,
478per-sequence, per-column, and per-residue annotation:
479
480\begin{description}%{\monob{\#=GR <seqname> <tag> <..s..>}}
481\item [\monob{\#=GF <tag> <s>}]
482        Per-file annotation. \mono{<s>} is a free format text line
483        of annotation type \mono{<tag>}. For example, \mono{\#=GF DATE
484        April 1, 2000}. Can occur anywhere in the file, but usually
485        all the \mono{\#=GF} markups occur in a header.
486
487\item [\monob{\#=GS <seqname> <tag> <s>}]
488        Per-sequence annotation. \mono{<s>} is a free format text line
489        of annotation type \mono{tag} associated with the sequence
490        named \mono{<seqname>}. For example, \mono{\#=GS seq1
491        SPECIES\_SOURCE Caenorhabditis elegans}. Can occur anywhere
492        in the file, but in single-block formats (e.g. the Pfam
493        distribution) will typically follow on the line after the
494        sequence itself, and in multi-block formats (e.g. HMMER
495        output), will typically occur in the header preceding the
496        alignment but following the \mono{\#=GF} annotation.
497
498\item [\monob{\#=GC <tag> <..s..>}]
499        Per-column annotation. \mono{<..s..>} is an aligned text line
500        of annotation type \mono{<tag>}.
501        \mono{\#=GC} lines are
502        associated with a sequence alignment block; \mono{<..s..>}
503        is aligned to the residues in the alignment block, and has
504        the same length as the rest of the block.
505        Typically \mono{\#=GC} lines are placed at the end of each block.
506
507\item [\monob{\#=GR <seqname> <tag> <..s..>}]
508        Per-residue annotation. \mono{<..s..>} is an aligned text line
509        of annotation type \mono{<tag>}, associated with the sequence
510        named \mono{<seqname>}.
511        \mono{\#=GR} lines are
512        associated with one sequence in a sequence alignment block;
513        \mono{<..s..>}
514        is aligned to the residues in that sequence, and has
515        the same length as the rest of the block.
516        Typically
517        \mono{\#=GR} lines are placed immediately following the
518        aligned sequence they annotate.
519\end{description}
520
521
522\subsection{semantics of Stockholm markup}
523
524Any Stockholm parser will accept syntactically correct files, but is
525not obligated to do anything with the markup lines. It is up to the
526application whether it will attempt to interpret the meaning (the
527semantics) of the markup in a useful way. At the two extremes are the
528Belvu alignment viewer and the HMMER profile hidden Markov model
529software package.
530
531Belvu simply reads Stockholm markup and displays it, without trying to
532interpret it at all. The tag types (\mono{\#=GF}, etc.) are sufficient
533to tell Belvu how to display the markup: whether it is attached to the
534whole file, sequences, columns, or residues.
535
536HMMER uses Stockholm markup to pick up a variety of information from
537the Pfam multiple alignment database. The Pfam consortium therefore
538agrees on additional syntax for certain tag types, so HMMER can parse
539some markups for useful information. This additional syntax is imposed
540by Pfam, HMMER, and other software of mine, not by Stockholm format
541per se. You can think of Stockholm as akin to XML, and what my
542software reads as akin to an XML DTD, if you're into that sort of
543structured data format lingo.
544
545The Stockholm markup tags that are parsed semantically by my software
546are as follows:
547
548\subsection{recognized \#=GF annotations}
549\begin{sreitems}{\monob{TC  <f> <f>}}
550\item [\monob{ID  <s>}]
551        Identifier. \monob{<s>} is a name for the alignment;
552        e.g. ``rrm''. One word. Unique in file.
553
554\item [\monob{AC  <s>}]
555        Accession. \monob{<s>} is a unique accession number for the
556        alignment; e.g.
557        ``PF00001''. Used by the Pfam database, for instance.
558        Often a alphabetical prefix indicating the database
559        (e.g. ``PF'') followed by a unique numerical accession.
560        One word. Unique in file.
561
562\item [\monob{DE  <s>}]
563        Description. \monob{<s>} is a free format line giving
564        a description of the alignment; e.g.
565        ``RNA recognition motif proteins''. One line. Unique in file.
566
567\item [\monob{AU  <s>}]
568        Author. \monob{<s>} is a free format line listing the
569        authors responsible for an alignment; e.g.
570        ``Bateman A''. One line. Unique in file.
571
572\item [\monob{GA  <f> <f>}]
573        Gathering thresholds. Two real numbers giving HMMER bit score
574        per-sequence and per-domain cutoffs used in gathering the
575        members of Pfam full alignments.
576
577\item [\monob{NC  <f> <f>}]
578        Noise cutoffs. Two real numbers giving HMMER bit score
579        per-sequence and per-domain cutoffs, set according to the
580        highest scores seen for unrelated sequences when gathering
581        members of Pfam full alignments.
582
583\item [\monob{TC  <f> <f>}]
584        Trusted cutoffs. Two real numbers giving HMMER bit score
585        per-sequence and per-domain cutoffs, set according to the
586        lowest scores seen for true homologous sequences that
587        were above the GA gathering thresholds, when gathering
588        members of Pfam full alignments.
589\end{sreitems}
590
591\subsection{recognized \#=GS annotations}
592
593\begin{sreitems}{\monob{WT  <f>}}
594\item [\monob{WT  <f>}]
595        Weight. \monob{<f>} is a positive real number giving the
596        relative weight for a sequence, usually used to compensate
597        for biased representation by downweighting similar sequences.
598        Usually the weights average 1.0 (e.g. the weights sum to
599        the number of sequences in the alignment) but this is not
600        required. Either every sequence must have a weight annotated,
601        or none of them can.
602
603\item [\monob{AC  <s>}]
604        Accession. \monob{<s>} is a database accession number for
605        this sequence. (Compare the \mono{\#=GF AC} markup, which gives
606        an accession for the whole alignment.) One word.
607
608\item [\monob{DE  <s>}]
609        Description. \monob{<s>} is one line giving a description for
610        this sequence. (Compare the \mono{\#=GF DE} markup, which gives
611        a description for the whole alignment.)
612\end{sreitems}
613
614
615\subsection{recognized \#=GC annotations}
616
617\begin{sreitems}{\monob{SA\_cons}}
618
619\item [\monob{RF}]
620        Reference line. Any character is accepted as a markup for a
621        column. The intent is to allow labeling the columns with some
622        sort of mark.
623
624\item [\monob{MM}]
625        Model mask line. An 'm' indicates that the column lies within a
626        masked range, so that \mono{hmmbuild} should produce emissions matching
627        the background for a match state corresponding to that alignment column.
628        Otherwise, a '.' is used.
629
630\item [\monob{SS\_cons}]
631        Secondary structure consensus. For protein alignments,
632        DSSP codes or gaps are accepted as markup: [HGIEBTSCX.-\_], where
633        H is alpha helix, G is 3/10-helix, I is p-helix, E is extended
634        strand, B is a residue in an isolated b-bridge, T is a turn,
635        S is a bend, C is a random coil or loop, and X is unknown
636        (for instance, a residue that was not resolved in a crystal
637        structure).
638
639\item [\monob{SA\_cons}]
640        Surface accessibility consensus. 0-9, gap symbols, or X are
641        accepted as markup. 0 means $<$10\% accessible residue surface
642        area, 1 means $<$20\%, 9 means $<$100\%, etc. X means unknown
643        structure.
644\end{sreitems}
645
646\subsection{recognized \#=GR annotations}
647\begin{sreitems}{\monob{SA}}
648\item [\monob{SS}]
649        Secondary structure consensus. See \mono{\#=GC SS\_cons} above.
650\item [\monob{SA}]
651        Surface accessibility consensus. See \mono{\#=GC SA\_cons} above.
652\item [\monob{PP}] Posterior probability for an aligned residue. This
653  represents the probability that this residue is assigned to the HMM
654  state corresponding to this alignment column, as opposed to some
655  other state. (Note that a residue can be confidently
656  \emph{unaligned}: a residue in an insert state or flanking N or C
657  state may have high posterior probability.) The posterior
658  probability is encoded as 11 possible characters \mono{0-9*+}: $0.0
659  \leq p < 0.05$ is coded as 0, $0.05 \leq p < 0.15$ is coded as 1,
660  (... and so on ...), $0.85 \leq p < 0.95$ is coded as 9, and $0.95
661  \leq p \leq 1.0$ is coded as '*'. Gap characters appear in the PP
662  line where no residue has been assigned.
663\end{sreitems}
664
665
666
667%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
668\newpage
669\section{A2M multiple alignment format}
670
671% Adapted from Easel documentation's format_a2m.tex by:
672%  - including format_a2m.tex
673%  - add the section header below
674%  - revise first pp to be about HMMER.
675
676HMMER's Easel library routines are capable of writing alignments in UC
677Santa Cruz ``A2M'' (alignment to model) format, the native input
678format for the UCSC SAM profile HMM software package.
679
680To select A2M format, use the format code \mono{a2m}: for example,
681to reformat a Stockholm alignment to A2M:
682
683\user{esl-reformat a2m myali.sto}
684
685Easel currently does not read A2M format, and it currently only writes
686in what UCSC calls ``dotless'' A2M format.
687
688The most official documentation for A2M format appears to be at
689\url{http://compbio.soe.ucsc.edu/a2m-desc.html}. You may refer to that
690document if anything in the brief description below is unclear.
691
692\subsection{An example A2M file}
693
694This alignment:
695
696\begin{sreoutput}
697seq1  ACDEF...GHIKLMNPQTVWY
698seq2  ACDEF...GHIKLMNPQTVWY
699seq3  ---EFmnrGHIKLMNPQT---
700\end{sreoutput}
701
702\noindent
703is encoded in A2M format as:
704
705\begin{sreoutput}
706>seq1  Sequence 1 description
707ACDEFGHIKLMNPQTVWY
708>seq2  Sequence 2 description
709ACDEFGHIKLMNPQTVWY
710>seq3  Sequence 3 description
711---EFmnrGHIKLMNPQT---
712\end{sreoutput}
713
714A2M format looks a lot like aligned FASTA format. A crucial difference
715is that the aligned sequences in a ``dotless'' A2M file do not
716necessarily all have the same number of characters. The format
717distinguishes between ``consensus columns'' (where residues are in
718upper case and gaps are a dash, `-') and ``insert columns'' (where
719residues are in lower case and gaps are dots, `.', that aren't
720explicitly shown in the format -- hence ``dotless'' A2M). The position
721and number of gaps in insert columns (dots) is implicit in this
722representation.  An advantage of this format is its compactness.
723
724This representation only works if all insertions relative to consensus
725are considered to be unaligned characters. That is how insertions are
726handled by profile HMM implementations like SAM and HMMER, and profile
727SCFG implementations like Infernal.
728
729Thus every sequence must have the same number of consensus columns
730(upper case letters plus `-' characters), and may have additional lower
731case letters for insertions.
732
733\subsection{Legal characters}
734
735A2M (and SAM) do not support some special characters such as `*'
736(not-a-residue) or `\mono{\~}' (missing data). Easel outputs these
737characters as gaps: either `-' in a consensus column, or nothing in an
738insert column.
739
740The SAM software parses only a subset of legal ambiguity codes for
741amino acids and nucleotides. For amino acids, it only reads \{BXZ\} in
742addition to the 20 standard one-letter codes. For nucleotides, it only
743reads \{NRY\} in addition to \{ACGTU\}. With one crucial exception, it
744treats all other letters as X or N.
745
746The crucial exception is `O'. SAM reads an `O' as the position of a
747``free insertion module'' (FIM), a concept specific to SAM-style
748profile HMMs. This has no impact on nucleic acid sequences, where `O'
749is not a legal character. But in amino acid sequences, `O' means
750pyrrolysine, one of the unusual genetically-encoded amino acids.  This
751means that A2M format alignments must not contain pyrrolysine
752residues, lest they be read as FIMs. For this reason, Easel converts
753`O' residues to `X' when it writes an amino acid alignment in A2M
754format.
755
756\subsection{Determining consensus columns}
757
758Writing A2M format requires knowing which alignment columns are
759supposed to be considered consensus and which are considered
760inserts. If the alignment was produced by HMMER or Infernal, then the
761alignment has so-called ``reference annotation'' (what appears as a
762\mono{\#=GC RF} line in Stockholm format) marking the consensus
763columns.
764
765Often, an alignment has no reference annotation; for example, if it
766has been read from an alignment format that has no reference
767annotation line (only Stockholm and SELEX formats support reference
768annotation). In this case, Easel internally generates a ``reasonable''
769guess at consensus columns, using essentially the same procedure that
770HMMER's \mono{hmmbuild} program uses by default: sequence fragments
771(sequences $<$50\% of the mean sequence length in the alignment
772overall) are ignored, and for the remaining sequences, any column
773containing $\geq$ 50\% residues is considered to be a consensus
774column.
775
776
777
778
779%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
780\newpage
781\section{hmmpgmd sequence database format}
782
783The hmmpgmd sequence database format closely resembles the
784FASTA format, with slight modification to support use within HMMER's
785\mono{hmmpgmd} daemon.
786
787
788The \mono{hmmpgmd} program enables search of one or more sequence
789databases (e.g. NR, SwissProt, UniProt) within a single instance,
790having loaded a single sequence file into memory. Because the set of
791sequences found within the various databases may overlap, the hmmpgmd
792format allows each sequence to be stored once, and includes a small piece of
793metadata that indicates, for each sequence, the list of source databases in
794which the sequence is found. When a search is performed in \mono{hmmpgmd}, a
795single target database is specified, and search is restricted to sequences
796belonging to that database.
797
798Furthermore, because a single sequence might be found multiple times
799within a single sequence database, \mono{hmmpgmd} is designed to compute
800E-values not just on the total number of non-redundant sequences
801within a database, but on the total number of sequences in the original
802(possibly redundant) database, provided those redundant counts are
803given in the hmmpgmd-formatted file.
804
805The \mono{hmmpgmd} file begins with a single line containing various counts
806describing the contents of the file, of the form
807
808\monob{\#res\_cnt seq\_cnt db\_cnt cnt\_1 fullcnt\_1 cnt\_2 fullcnt\_2 $\ldots$ date\_stamp}
809
810\subsection{Fields in header line}
811
812\begin{sreitems}{\monob{header}}
813
814\item [\monob{res\_cnt}] Number of residues in the sequence file.
815
816\item [\monob{seq\_cnt}] Number of sequences in the sequence file.
817
818\item [\monob{db\_cnt}] Number of databases in the sequence file.
819
820\item [\monob{cnt\_i}] Of the sequnces in the file, the number that belong to
821database \mono{i}. Note that if the file contains only a single
822database, this will match \mono{seq\_cnt}.
823
824\item [\monob{fullcnt\_i}] For database \mono{i}, the number of sequences
825that should be used in computing E-values. If redundant
826sequences were collapsed out of the original database, this may be
827larger than \mono{cnt\_i}.
828
829\end{sreitems}
830
831
832\subsection{FASTA-like sequence format}
833
834In the main body of the sequence file, database sequences are stored
835sequentially, with each entry consisting of a one-line FASTA-like
836header followed by one or more lines containing the amino acid sequence,
837like
838
839\begin{sreoutput}
840>1 100
841ACDEFGHIKLMNPQTVWY
842>2 010
843ACDKLMNPQTVWYEFGHI
844>3 111
845EFMNRGHIKLMNPQT
846\end{sreoutput}
847
848Note that the per-entry header line is made up of two parts. The first part
849is a numeric, incrementing sequence identifier (the i'th entry has the
850identifier \mono{i}). The second part is a bit string indicating database
851membership. In this example, sequence 1 is found in database 1, sequence 2 is
852found in database 2, and sequence 3 found in databases 1, 2, and 3. The number
853of bits in each bit string should match \mono{db\_cnt}.
854
855Because \mono{hmmpgmd} accepts only numeric sequence identifiers, it is
856necessary to keep track of the mapping between each numeric sequence identifier
857and the corresponding meta data (e.g. name, accession, description) external to
858the hmmpgmd-format file, and post-process \mono{hmmpgmd} seach results to
859apply those fields to the target sequence information.
860Depending on the size of the map list, this might be easily acheived with a
861simple perl array, or might require a more heavy-duty mapping backend such as
862mongodb (\url{http://www.mongodb.org}).
863
864
865\subsection{Creating a file in hmmpgmd format}
866
867The HMMER-Easel tool \mono{esl-reformat} is able to convert a file in unaligned
868fasta format into an hmmpgmd format file, such as
869
870\user{esl-reformat --id\_map mydb.hmmpgmd.map hmmpgmd mydb.fasta > mydb.hmmpgmd}
871
872The optional \mono{--id\_map} flag captures sequence name and description
873information into a simple tabular file, to be used for mapping those values
874back to \mono{hmmpgmd} query hits.
875
876
877
878%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
879\newpage
880\section{Score matrix files}
881
882Profile HMMs can be built from single sequences, not just from
883multiple sequence alignments. For example, the \mono{phmmer} and
884\mono{jackhmmer} search programs take a single sequence query as an
885input, and \mono{nhmmer} can as well. For single sequence queries, the
886probability parameters for residue alignments are derived from a
887substitution score matrix, such as BLOSUM62. Scores are converted to
888probabilities as described by Altschul (1991).\cite{Altschul91}. The
889scores can be arbitrary, but they must satisfy a couple of conditions
890so they can be interpreted as implicit log-odds probabilities: there
891must be at least one positive score, and the expected score (on
892nonhomologous alignments) must be negative.
893
894The default score matrix for protein alignments is BLOSUM62; for DNA,
895a matrix we call DNA1. Using the \mono{--mx} option (for programs that
896can use score matrices), you can choose instead one of several
897alternative protein score matrices: PAM30, PAM70, PAM120, PAM240,
898BLOSUM45, BLOSUM50, BLOSUM80, and BLOSUM90. For example, you could use
899\mono{--mx BLOSUM50}.
900
901The \mono{--mxfile} option allows you to provide a score matrix as a
902file. You can download many standard score matrice files from NCBI at
903\url{ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/}. HMMER can read
904\emph{almost} any of these files, with one exception: because it
905requires the score matrix to be symmetrical (same number of residues
906in rows and columns), the NCBI \mono{NUC.4.4} matrix for DNA works,
907but the alternative short-form format \mono{NUC.4.2} does not
908work. The scores in these two files are identical, so just use
909\mono{NUC.4.4}, and files like it.
910
911Here's a simple example file:
912
913\begin{sreoutput}
914# My score matrix
915#
916   A  T  G  C
917A  1 -3 -3 -3
918T -3  1 -3 -3
919G -3 -3  1 -3
920C -3 -3 -3  1
921\end{sreoutput}
922
923In more detail, the rules for the format are:
924
925\begin{itemize}
926
927\item Blank lines are ignored.
928
929\item A \mono{\#} indicates a comment. Anything after it is
930  ignored. Lines that start with \mono{\#} are ignored like blank
931  lines.
932
933\item The first data line is a header line, labeling each column with
934  $n$ single residue characters (case-insensitive). A nucleic matrix
935  has $4 \leq n \leq 16$: at least the four canonical residues
936  \mono{ACGT} (or \mono{U}, for you RNA zealots), and it may also
937  contain any or all ambiguity codes \mono{RYMKSWHBVDN*}. A protein
938  matrix has $20 \leq n 27$; it must contain at least the 20 canonical
939  residues \mono{ACDEFGHIKLMNPQRSTVWY}, and may contain any or all
940  ambiguity codes \mono{BJZOU}. These residues can be in any order,
941  but the rows must be in the same order as the columns.
942
943\item The next $n$ data lines are the rows of a square $n \times n$
944  score matrix:
945
946\begin{itemize}
947\item  Fields in the row are whitespace-delimited (tabs or
948       spaces).
949
950\item Optionally, each row can start with its single residue label.
951  Therefore each row has either $n+1$ fields if there is a leading
952  label, or $n$ fields if not. Rows and columns are in the same order.
953
954\item  Each score is an integer. Plus signs are optional.
955\end{itemize}
956
957\item The file may not contain any additional lines (other than
958  comments or blank lines).
959\end{itemize}
960
961HMMER only uses the scores for canonical residues, and uses them to
962calculate probabilities. If scores for ambiguous residue codes are
963provided, HMMER ignores them; it has its own logic for dealing with
964the probability of ambiguous residues, given the probability of
965canonical residues.
966
967
968
969
970
971
972
973