1\chapter{Input files and formats} 2\label{chapter:formats} 3\setcounter{footnote}{0} 4 5\section{Reading from files, compressed files, and pipes} 6 7Generally, HMMER programs read their sequence and/or profile input 8from files. Unix power users often find it convenient to string an 9incantation of commands together with pipes.\sidenote{Indeed, such 10 wizardly incantations are a point of pride.} For example, you might 11extract a subset of query sequences from a larger file using a 12one-liner combination of scripting commands (python, perl, awk, 13whatever). To facilitate the use of HMMER programs in such 14incantations, you can almost always use an argument of '-' (dash) in 15place of a filename, and the program will take its input from a 16standard input pipe instead of opening a file. 17 18For example, the following three commands are equivalent, and give 19essentially identical output: 20 21 \vspace{1ex} 22 \user{\% hmmsearch globins4.hmm uniprot\_sprot.fasta} \\ 23 \user{\% cat globins4.hmm | hmmsearch - uniprot\_sprot.fasta} \\ 24 \user{\% cat uniprot\_sprot.fasta | hmmsearch globins4.hmm - } \\ 25 \vspace{1ex} 26 27Most Easel ``miniapp'' programs share the same pipe-reading ability. 28 29Because the programs for profile HMM fetching (\mono{hmmfetch}) and 30sequence fetching (\mono{esl-sfetch}) can fetch any number of profiles 31or sequences by names/accessions given in a list, \emph{and} these 32programs can also read these lists from a stdin pipe, you can craft 33incantations that generate subsets of queries or targets on the 34fly. For example: 35 36 \vspace{1ex} 37 \begin{fullwidth} 38 \user{\indent \% esl-sfetch --index uniprot\_sprot.fasta} \\ 39 \user{\% cat mytargs.list | esl-sfetch -f uniprot\_sprot.fasta - | hmmsearch globins4.hmm -} \\ 40 \end{fullwidth} 41 \vspace{1ex} 42 43This takes a list of sequence names/accessions in 44\mono{mytargets.list}, fetches them one by one from UniProt (note that 45we index the UniProt file first, for fast retrieval; and note that 46\mono{esl-sfetch} is reading its \mono{<namefile>} list of 47names/accessions through a pipe using the '-' argument), and pipes 48them to an \mono{hmmsearch}. It should be obvious from this that we 49can replace the \mono{cat mytargets.list} with \emph{any} incantation 50that generates a list of sequence names/accessions (including SQL 51database queries). 52 53Ditto for piping subsets of profiles. Supposing you have a copy of Pfam in Pfam-A.hmm: 54 55 \vspace{1ex} 56 \begin{fullwidth} 57 \user{\indent \% hmmfetch --index Pfam-A.hmm} \\ 58 \user{\% cat myqueries.list | hmmfetch -f Pfam.hmm - | hmmsearch - uniprot\_sprot.fasta}\\ 59 \end{fullwidth} 60 \vspace{1ex} 61 62This takes a list of query profile names/accessions in 63\mono{myqueries.list}, fetches them one by one from Pfam, and does an 64hmmsearch with each of them against UniProt. As above, the \mono{cat 65 myqueries.list} part can be replaced by any suitable incantation 66that generates a list of profile names/accessions. 67 68There are three kinds of cases where using '-' is restricted or 69doesn't work. A fairly obvious restriction is that you can only use 70one '-' per command; you can't do a \mono{hmmsearch - -} that tries to 71read both profile queries and sequence targets through the same stdin 72pipe. Second, another case is when an input file must be obligately 73associated with additional, separately generated auxiliary files, so 74reading data from a single stream using '-' doesn't work because the 75auxiliary files aren't present (in this case, using '-' will be 76prohibited by the program). An example is \mono{hmmscan}, which needs 77its \mono{<hmmfile>} argument to be associated with four auxiliary 78files named \mono{<hmmfile>.h3\{mifp\}} that \mono{hmmpress} creates, 79so \mono{hmmscan} does not permit a '-' for its \mono{<hmmfile>} 80argument. Finally, when a command would require multiple passes over 81an input file, the command will generally abort after the first pass 82if you are trying to read that file through a standard input pipe 83(pipes are nonrewindable in general; a few HMMER or Easel programs 84will buffer input streams to make multiple passes possible, but this 85is not usually the case). An example would be trying to search a file 86containing multiple profile queries against a streamed target 87database: 88 89 \vspace{1ex} 90 \begin{fullwidth} 91 \user{\indent \% cat myqueries.list | hmmfetch -f Pfam.hmm > many.hmms}\\ 92 \user{\% cat mytargets.list | esl-sfetch -f uniprot\_sprot.fasta - | hmmsearch many.hmms -}\\ 93 \end{fullwidth} 94 \vspace{1ex} 95 96This will fail. Unfortunately the above business about how it will 97``generally abort after the first pass'' means it fails weirdly. The 98first query profile search will succeed, and its output will appear; 99then an error message will be generated when \mono{hmmsearch} sees the 100\emph{second} profile query and oops, suddenly realizes it is unable 101to rewind the target sequence database stream. This is inherent in how 102it reads the profile HMM query file sequentially as a stream (which is 103what's allowing it to read input from stdin pipes in the first place), 104one model at a time: it doesn't see there's more than one query model 105in the file until it gets to the second model. 106 107This case isn't too restricting because the same end goal can be 108achieved by reordering the commands. In cases where you want to do 109multiple queries against multiple targets, you always want to be 110reading the \emph{queries} from a stdin pipe, not the targets: 111 112 \vspace{1ex} 113 \user{\% cat mytargets.list | esl-sfetch -f uniprot\_sprot.fasta > mytarget.seqs}\\ 114 \user{\% cat myqueries.list | hmmfetch -f Pfam.hmm - | hmmsearch - mytarget.seqs}\\ 115 \vspace{1ex} 116 117So in this multiple queries/multiple targets case of using stdin 118pipes, you just have to know, for any given program, which file it 119considers to be queries and which it considers to be targets. (That 120is, the logic in searching many queries against many targets is ``For 121each query: search the target database; then rewind the target 122database to the beginning.'') For \mono{hmmsearch}, the profiles are 123queries and sequences are targets. For \mono{hmmscan}, the reverse. 124 125In general, HMMER and Easel programs document in their man page 126whether (and which) command line arguments can be replaced by '-'. 127You can always check by trial and error, too. The worst that can 128happen is a ``Failed to open file -'' error message, if the program 129can't read from pipes. 130 131\subsection{.gz compressed files} 132 133In general, HMMER programs and Easel miniapps can also read \mono{.gz} 134compressed files; they will uncompress them on the fly. You need to 135have \mono{gunzip} installed on your system for this to work. 136 137 138 139 140%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 141\newpage 142\section{HMMER profile HMM files} 143\label{section:savefiles} 144 145A HMMER profile file looks like this, with ...'s marking elisions made 146for clarity and space:\sidenote{This is the \mono{globins4.hmm} 147 profile from the tutorial.} 148 149 \xsreoutput{inclusions/hmmbuild-globins.out2} 150 151A profile file consists of one or more profiles. Each profile starts 152with a format version identifier (here, \mono{HMMER3/\HMMERfmtversion{}}) and ends with 153\mono{//} on a line by itself. The format version identifier allows 154backward compatibility as the HMMER software evolves: it tells the 155parser this file is from HMMER3's save file format version 156\HMMERfmtversion{}.\footnote{HMMER 3.0 used 3/b format. HMMER 3.1 and 3.2 use 3/f 157 format. Some alpha test versions of 3.0 used 3/a format. Internal 158 development versions of 3.1 used 3/c, 3/d, and 3/e formats.} The 159closing \mono{//} allows multiple profiles to be concatenated. 160 161The format is divided into two regions. The first region contains 162textual information and miscalleneous parameters in a roughly 163tag-value scheme. This section ends with a line beginning with the 164keyword \mono{HMM}. The second region is a tabular, whitespace-limited 165format for the main model parameters. 166 167All probability parameters are all stored as negative natural log 168probabilities with five digits of precision to the right of the 169decimal point, rounded. For example, a probability of $0.25$ is stored 170as $-\log 0.25 = 1.38629$. The special case of a zero probability is 171stored as '*'. 172 173Spacing is arranged for human readability, but the parser only cares 174that fields are separated by at least one space character. 175 176A more detailed description of the format follows. 177 178\subsection{header section} 179 180The header section is parsed line by line in a tag/value format. Each 181line type is either \textbf{mandatory} or \textbf{optional} as 182indicated. 183 184\begin{sreitems}{\monob{header}} 185 186\item [\monob{HMMER3/\HMMERfmtversion{}}] Unique identifier for the save file format 187 version; the \mono{/\HMMERfmtversion{}} means that this is HMMER3 profile file format 188 version \HMMERfmtversion{}. When HMMER3 changes its save file format, the revision 189 code advances. This way, parsers can be backwards 190 compatible. The remainder of the line after the \mono{HMMER3/\HMMERfmtversion{}} tag 191 is free text that is ignored by the parser. HMMER currently writes 192 its version number and release date in brackets here, 193 e.g. \mono{\HMMERsavestamp{}} in this 194 example. \textbf{Mandatory.} 195 196\item [\monob{NAME <s>}] Model name; \mono{<s>} is a single word 197containing no spaces or tabs. The name is normally picked up from the 198\mono{\#=GF ID} line from a Stockholm alignment file. If this is not 199present, the name is created from the name of the alignment file by 200removing any file type suffix. For example, an otherwise nameless HMM 201built from the alignment file \mono{rrm.slx} would be named 202\mono{rrm}. \textbf{Mandatory.} 203 204\item [\monob{ACC <s>}] Accession number; \mono{<s>} is a one-word 205accession number. This is picked up from the \mono{\#=GF AC} line in a 206Stockholm format alignment. \textbf{Optional.} 207 208\item [\monob{DESC <s>}] Description line; \mono{<s>} is a one-line 209free text description. This is picked up from the \mono{\#=GF DE} line 210in a Stockholm alignment file. \textbf{Optional.} 211 212\item [\monob{LENG <d>}] Model length; \mono{<d>}, a positive nonzero 213integer, is the number of match states in the model. 214\textbf{Mandatory.} 215 216\item [\monob{MAXL <d>}] Max instance length; \mono{<d>}, a positive 217nonzero integer, is the upper bound on the length at which and instance 218of the model is expected to be found. Used only by nhmmer and nhmmscan. 219\textbf{Optional.} 220 221\item [\monob{ALPH <s>}] Symbol alphabet type. For biosequence 222analysis models, \mono{<s>} is \mono{amino}, \mono{DNA}, or \mono{RNA} 223(case insensitive). There are also other accepted alphabets for 224purposes beyond biosequence analysis, including \mono{coins}, 225\mono{dice}, and \mono{custom}. This determines the symbol alphabet 226and the size of the symbol emission probability distributions. If 227\mono{amino}, the alphabet size $K$ is set to 20 and the symbol 228alphabet to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order); if 229\mono{DNA}, the alphabet size $K$ is set to 4 and the symbol alphabet 230to ``ACGT''; if \mono{RNA}, the alphabet size $K$ is set to 4 and the 231symbol alphabet to ``ACGU''. \textbf{Mandatory.} 232 233\item [\monob{RF <s>}] Reference annotation flag; \mono{<s>} is 234either \mono{no} or \mono{yes} (case insensitive). If \mono{yes}, the 235reference annotation character field for each match state in the main 236model (see below) is valid; if \mono{no}, these characters are 237ignored. Reference column annotation is picked up from a Stockholm 238alignment file's \mono{\#=GC RF} line. It is propagated to alignment 239outputs, and also may optionally be used to define consensus match 240columns in profile HMM construction. \textbf{Optional}; assumed to be 241no if not present. 242 243\item [\monob{MM <s>}] Model masked flag; \mono{<s>} is 244either \mono{no} or \mono{yes} (case insensitive). If \mono{yes}, the 245model mask annotation character field for each match state in the main 246model (see below) is valid; if \mono{no}, these characters are 247ignored. Indicates that the profile model was created such that 248emission probabilities at masked positions are set to match the 249background frequency, rather than being set based on observed frequencies 250in the alignment. Position-specific insertion and deletion rates are not 251altered, even in masked regions. \textbf{Optional}; assumed to be 252no if not present. 253 254\item [\monob{CONS <s>}] Consensus residue annotation flag; 255 \mono{<s>} is either \mono{no} or \mono{yes} (case insensitive). If 256 \mono{yes}, the consensus residue field for each match state in the 257 main model (see below) is valid. If \mono{no}, these characters are 258 ignored. Consensus residue annotation is determined when models are 259 built. For models of single sequences, the consensus is the same as 260 the query sequence. For models of multiple alignments, the consensus 261 is the maximum likelihood residue at each position. Upper case 262 indicates that the model's emission probability for the consensus 263 residue is $\geq$ an arbitrary threshold (0.5 for protein models, 264 0.9 for DNA/RNA models). 265 266\item [\monob{CS <s>}] Consensus structure annotation flag; 267\mono{<s>} is either \mono{no} or \mono{yes} (case insensitive). If 268\mono{yes}, the consensus structure character field for each match 269state in the main model (see below) is valid; if \mono{no} these 270characters are ignored. Consensus structure annotation is picked up 271from a Stockholm file's \mono{\#=GC SS\_cons} line, and propagated to 272alignment displays. \textbf{Optional}; assumed to be no if not 273present. 274 275\item [\monob{MAP <s>}] Map annotation flag; \mono{<s>} is either 276\mono{no} or \mono{yes} (case insensitive). If set to \mono{yes}, the 277map annotation field in the main model (see below) is valid; if 278\mono{no}, that field will be ignored. The HMM/alignment map 279annotates each match state with the index of the alignment column from 280which it came. It can be used for quickly mapping any subsequent 281HMM alignment back to the original multiple alignment, via the model. 282\textbf{Optional}; assumed to be no if not present. 283 284\item [\monob{DATE <s>}] Date the model was constructed; \mono{<s>} 285is a free text date string. This field is only used for logging 286purposes.\footnote{HMMER does not use dates for any purpose other than 287human-readable annotation, so it is no more prone than you are to Y2K, 288Y2038, or any other date-related eschatology.} \textbf{Optional.} 289 290\item [\monob{COM [<n>] <s>}] Command line log; \mono{<n>} counts 291command line numbers, and \mono{<s>} is a one-line command. There may 292be more than one \mono{COM} line per save file, each numbered starting 293from $n=1$. These lines record every HMMER command that modified the 294save file. This helps us reproducibly and automatically log how Pfam 295models have been constructed, for example. \textbf{Optional.} 296 297\item [\monob{NSEQ <d>}] Sequence number; \mono{<d>} is a nonzero 298positive integer, the number of sequences that the HMM was trained on. 299This field is only used for logging purposes. 300\textbf{Optional.} 301 302\item [\monob{EFFN <f>}] Effective sequence number; \mono{<f>} is a 303nonzero positive real, the effective total number of sequences 304determined by \mono{hmmbuild} during sequence weighting, for combining 305observed counts with Dirichlet prior information in parameterizing the 306model. This field is only used for logging purposes. 307\textbf{Optional.} 308 309\item [\monob{CKSUM <d>}] Training alignment checksum; \mono{<d>} is 310 a nonnegative unsigned 32-bit integer. This number is calculated 311 from the training sequence data, and used in conjunction with the 312 alignment map information to verify that a given alignment is indeed 313 the alignment that the map is for. \textbf{Optional.} 314 315\item [\monob{GA <f> <f>}] Pfam gathering thresholds GA1 and GA2. 316See Pfam documentation of GA lines. \textbf{Optional.} 317 318\item [\monob{TC <f> <f>}] Pfam trusted cutoffs TC1 and TC2. See 319Pfam documentation of TC lines. \textbf{Optional.} 320 321\item [\monob{NC <f> <f>}] Pfam noise cutoffs NC1 and NC2. See Pfam 322documentation of NC lines. \textbf{Optional.} 323 324\item [\monob{STATS <s1> <s2> <f1> <f2>}] Statistical parameters 325 needed for E-value calculations. \mono{<s1>} is the model's 326 alignment mode configuration: currently only \mono{LOCAL} is 327 recognized. \mono{<s2>} is the name of the score distribution: 328 currently \mono{MSV}, \mono{VITERBI}, and \mono{FORWARD} are 329 recognized. \mono{<f1>} and \mono{<f2>} are two real-valued 330 parameters controlling location and slope of each distribution, 331 respectively; $\mu$ and $\lambda$ for Gumbel distributions for MSV 332 and Viterbi scores, and $\tau$ and $\lambda$ for exponential tails 333 for Forward scores. $\lambda$ values must be positive. All three 334 lines or none of them must be present: when all three are present, 335 the model is considered to be calibrated for E-value 336 statistics. \textbf{Optional.} 337 338\item [\monob{HMM }] Flags the start of the main model 339section. Solely for human readability of the tabular model data, the 340symbol alphabet is shown on the \mono{HMM} line, aligned to the fields 341of the match and insert symbol emission distributions in the main 342model below. The next line is also for human readability, providing 343column headers for the state transition probability fields in the main 344model section that follows. Though unparsed after the \mono{HMM} tag, 345the presence of two header lines is \textbf{mandatory:} the parser 346always skips the line after the \mono{HMM} tag line. 347 348\item [\monob{COMPO <f>*K}] The first line in the main model section 349may be an optional line starting with \monob{COMPO}: these are the 350model's overall average match state emission probabilities, which are 351used as a background residue composition in the ``filter null'' 352model. The $K$ fields on this line are log probabilities for each 353residue in the appropriate biosequence alphabet's 354order. \textbf{Optional.} 355 356\end{sreitems} 357 358\subsection{main model section} 359 360All the remaining fields are \textbf{mandatory}. 361 362The first two lines in the main model section are 363atypical.\footnote{That is, the first two lines after the optional 364 \mono{COMPO} line. Don't be confused by the presence of an optional \mono{COMPO} 365 line here. The \mono{COMPO} line is placed in the model section, below the 366 residue column headers, because it's an array of numbers much like 367 residue scores, but it's not really part of the model.} They 368contain information for the core model's BEGIN node. This is stored as 369model node 0, and match state 0 is treated as the BEGIN state. The 370begin state is mute, so there are no match emission probabilities. The 371first line is the insert 0 emissions. The second line contains the 372transitions from the begin state and insert state 0. These seven 373numbers are: $B \rightarrow M_1$, $B \rightarrow I_0$, $B \rightarrow 374D_1$; $I_0 \rightarrow M_1$, $I_0 \rightarrow I_0$; then a 0.0 and a 375'*', because by convention, nonexistent transitions from the 376nonexistent delete state 0 are set to $\log 1 = 0$ and $\log 0 = 377-\infty = $ `*'. 378 379The remainder of the model has three lines per node, for $M$ nodes 380(where $M$ is the number of match states, as given by the \mono{LENG} 381line). These three lines are ($K$ is the alphabet size in residues): 382 383\begin{sreitems}{\textbf{State transition line}} 384 385\item [\textbf{Match emission line}] The first field is the node 386number ($1 \ldots M$). The parser verifies this number as a 387consistency check (it expects the nodes to come in order). The next 388$K$ numbers for match emissions, one per symbol, in alphabetic order. 389 390The next field is the \mono{MAP} annotation for this node. If 391\mono{MAP} was \mono{yes} in the header, then this is an integer, 392representing the alignment column index for this match state 393(1..alen); otherwise, this field is `-'. 394 395The next field is the \mono{CONS} consensus residue for this node. If 396\mono{CONS} was \mono{yes} in the header, then this is a single 397character, representing the consensus residue annotation for this 398match state; otherwise, this field is `-'. 399 400The next field is the \mono{RF} annotation for this node. If 401\mono{RF} was \mono{yes} in the header, then this is a single 402character, representing the reference annotation for this match state; 403otherwise, this field is `-'. 404 405The next field is the \mono{MM} mask value for this node. If 406\mono{MM} was \mono{yes} in the header, then this is a single 'm' 407character, indicating that the position was identified as a masked 408position during model construction; otherwise, this field is `-'. 409 410The next field is the \mono{CS} annotation for this node. If 411\mono{CS} was \mono{yes}, then this is a single character, 412representing the consensus structure at this match state; otherwise 413this field is `-'. 414 415\item [\textbf{Insert emission line}] The $K$ fields on this line are 416the insert emission scores, one per symbol, in alphabetic order. 417 418\item [\textbf{State transition line}] The seven fields on this line 419are the transitions for node $k$, in the order shown by the transition 420header line: $M_k \rightarrow M_{k+1}, I_{k}, D_{k+1}$; $ I_k 421\rightarrow M_{k+1}, I_k$; $D_{k} \rightarrow M_{k+1}, D_{k+1}$. 422 423For transitions from the final node $M$, match state $M+1$ is 424interpreted as the END state $E$, and there is no delete state $M+1$; 425therefore the final $M_k \rightarrow D_{k+1}$ and $D_k \rightarrow 426D_{k+1}$ transitions are always * (zero probability), and the final 427$D_k \rightarrow M_{k+1}$ transition is always 0.0 (probability 1.0). 428\end{sreitems} 429 430Finally, the last line of the format is the ``//'' record separator. 431 432%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 433\newpage 434\section{Stockholm, the recommended multiple sequence alignment format} 435\label{section:stockholm} 436 437The Pfam and Rfam Consortiums have developed a multiple sequence 438alignment format called ``Stockholm format'' that allows rich and 439extensible annotation. 440 441Most popular multiple alignment file formats can be changed into a 442minimal Stockholm format file just by adding a Stockholm header line 443and a trailing \mono{//} terminator: 444 445 \xsreoutput{inclusions/globins4.sto} 446 447The first line in the file must be \mono{\# STOCKHOLM 1.x}, where 448\mono{x} is a minor version number for the format specification (and 449which currently has no effect on my parsers). This line allows a 450parser to instantly identify the file format. 451 452In the alignment, each line contains a name, followed by the aligned 453sequence. A dash, period, underscore, or tilde (but not whitespace) 454denotes a gap. If the alignment is too long to fit on one line, the 455alignment may be split into multiple blocks, with blocks separated by 456blank lines. The number of sequences, their order, and their names 457must be the same in every block. Within a given block, each 458(sub)sequence (and any associated \mono{\#=GR} and \mono{\#=GC} markup, 459see below) is of equal length, called the \emph{block length}. Block 460lengths may differ from block to block. The block length must be at 461least one residue, and there is no maximum. 462 463Other blank lines are ignored. You can add comments anywhere to the 464file (even within a block) on lines starting with a \mono{\#}. 465 466All other annotation is added using a tag/value comment style. The 467tag/value format is inherently extensible, and readily made 468backwards-compatible; unrecognized tags will simply be ignored. Extra 469annotation includes consensus and individual RNA or protein secondary 470structure, sequence weights, a reference coordinate system for the 471columns, and database source information including name, accession 472number, and coordinates (for subsequences extracted from a longer 473source sequence) See below for details. 474 475\subsection{syntax of Stockholm markup} 476 477There are four types of Stockholm markup annotation, for per-file, 478per-sequence, per-column, and per-residue annotation: 479 480\begin{description}%{\monob{\#=GR <seqname> <tag> <..s..>}} 481\item [\monob{\#=GF <tag> <s>}] 482 Per-file annotation. \mono{<s>} is a free format text line 483 of annotation type \mono{<tag>}. For example, \mono{\#=GF DATE 484 April 1, 2000}. Can occur anywhere in the file, but usually 485 all the \mono{\#=GF} markups occur in a header. 486 487\item [\monob{\#=GS <seqname> <tag> <s>}] 488 Per-sequence annotation. \mono{<s>} is a free format text line 489 of annotation type \mono{tag} associated with the sequence 490 named \mono{<seqname>}. For example, \mono{\#=GS seq1 491 SPECIES\_SOURCE Caenorhabditis elegans}. Can occur anywhere 492 in the file, but in single-block formats (e.g. the Pfam 493 distribution) will typically follow on the line after the 494 sequence itself, and in multi-block formats (e.g. HMMER 495 output), will typically occur in the header preceding the 496 alignment but following the \mono{\#=GF} annotation. 497 498\item [\monob{\#=GC <tag> <..s..>}] 499 Per-column annotation. \mono{<..s..>} is an aligned text line 500 of annotation type \mono{<tag>}. 501 \mono{\#=GC} lines are 502 associated with a sequence alignment block; \mono{<..s..>} 503 is aligned to the residues in the alignment block, and has 504 the same length as the rest of the block. 505 Typically \mono{\#=GC} lines are placed at the end of each block. 506 507\item [\monob{\#=GR <seqname> <tag> <..s..>}] 508 Per-residue annotation. \mono{<..s..>} is an aligned text line 509 of annotation type \mono{<tag>}, associated with the sequence 510 named \mono{<seqname>}. 511 \mono{\#=GR} lines are 512 associated with one sequence in a sequence alignment block; 513 \mono{<..s..>} 514 is aligned to the residues in that sequence, and has 515 the same length as the rest of the block. 516 Typically 517 \mono{\#=GR} lines are placed immediately following the 518 aligned sequence they annotate. 519\end{description} 520 521 522\subsection{semantics of Stockholm markup} 523 524Any Stockholm parser will accept syntactically correct files, but is 525not obligated to do anything with the markup lines. It is up to the 526application whether it will attempt to interpret the meaning (the 527semantics) of the markup in a useful way. At the two extremes are the 528Belvu alignment viewer and the HMMER profile hidden Markov model 529software package. 530 531Belvu simply reads Stockholm markup and displays it, without trying to 532interpret it at all. The tag types (\mono{\#=GF}, etc.) are sufficient 533to tell Belvu how to display the markup: whether it is attached to the 534whole file, sequences, columns, or residues. 535 536HMMER uses Stockholm markup to pick up a variety of information from 537the Pfam multiple alignment database. The Pfam consortium therefore 538agrees on additional syntax for certain tag types, so HMMER can parse 539some markups for useful information. This additional syntax is imposed 540by Pfam, HMMER, and other software of mine, not by Stockholm format 541per se. You can think of Stockholm as akin to XML, and what my 542software reads as akin to an XML DTD, if you're into that sort of 543structured data format lingo. 544 545The Stockholm markup tags that are parsed semantically by my software 546are as follows: 547 548\subsection{recognized \#=GF annotations} 549\begin{sreitems}{\monob{TC <f> <f>}} 550\item [\monob{ID <s>}] 551 Identifier. \monob{<s>} is a name for the alignment; 552 e.g. ``rrm''. One word. Unique in file. 553 554\item [\monob{AC <s>}] 555 Accession. \monob{<s>} is a unique accession number for the 556 alignment; e.g. 557 ``PF00001''. Used by the Pfam database, for instance. 558 Often a alphabetical prefix indicating the database 559 (e.g. ``PF'') followed by a unique numerical accession. 560 One word. Unique in file. 561 562\item [\monob{DE <s>}] 563 Description. \monob{<s>} is a free format line giving 564 a description of the alignment; e.g. 565 ``RNA recognition motif proteins''. One line. Unique in file. 566 567\item [\monob{AU <s>}] 568 Author. \monob{<s>} is a free format line listing the 569 authors responsible for an alignment; e.g. 570 ``Bateman A''. One line. Unique in file. 571 572\item [\monob{GA <f> <f>}] 573 Gathering thresholds. Two real numbers giving HMMER bit score 574 per-sequence and per-domain cutoffs used in gathering the 575 members of Pfam full alignments. 576 577\item [\monob{NC <f> <f>}] 578 Noise cutoffs. Two real numbers giving HMMER bit score 579 per-sequence and per-domain cutoffs, set according to the 580 highest scores seen for unrelated sequences when gathering 581 members of Pfam full alignments. 582 583\item [\monob{TC <f> <f>}] 584 Trusted cutoffs. Two real numbers giving HMMER bit score 585 per-sequence and per-domain cutoffs, set according to the 586 lowest scores seen for true homologous sequences that 587 were above the GA gathering thresholds, when gathering 588 members of Pfam full alignments. 589\end{sreitems} 590 591\subsection{recognized \#=GS annotations} 592 593\begin{sreitems}{\monob{WT <f>}} 594\item [\monob{WT <f>}] 595 Weight. \monob{<f>} is a positive real number giving the 596 relative weight for a sequence, usually used to compensate 597 for biased representation by downweighting similar sequences. 598 Usually the weights average 1.0 (e.g. the weights sum to 599 the number of sequences in the alignment) but this is not 600 required. Either every sequence must have a weight annotated, 601 or none of them can. 602 603\item [\monob{AC <s>}] 604 Accession. \monob{<s>} is a database accession number for 605 this sequence. (Compare the \mono{\#=GF AC} markup, which gives 606 an accession for the whole alignment.) One word. 607 608\item [\monob{DE <s>}] 609 Description. \monob{<s>} is one line giving a description for 610 this sequence. (Compare the \mono{\#=GF DE} markup, which gives 611 a description for the whole alignment.) 612\end{sreitems} 613 614 615\subsection{recognized \#=GC annotations} 616 617\begin{sreitems}{\monob{SA\_cons}} 618 619\item [\monob{RF}] 620 Reference line. Any character is accepted as a markup for a 621 column. The intent is to allow labeling the columns with some 622 sort of mark. 623 624\item [\monob{MM}] 625 Model mask line. An 'm' indicates that the column lies within a 626 masked range, so that \mono{hmmbuild} should produce emissions matching 627 the background for a match state corresponding to that alignment column. 628 Otherwise, a '.' is used. 629 630\item [\monob{SS\_cons}] 631 Secondary structure consensus. For protein alignments, 632 DSSP codes or gaps are accepted as markup: [HGIEBTSCX.-\_], where 633 H is alpha helix, G is 3/10-helix, I is p-helix, E is extended 634 strand, B is a residue in an isolated b-bridge, T is a turn, 635 S is a bend, C is a random coil or loop, and X is unknown 636 (for instance, a residue that was not resolved in a crystal 637 structure). 638 639\item [\monob{SA\_cons}] 640 Surface accessibility consensus. 0-9, gap symbols, or X are 641 accepted as markup. 0 means $<$10\% accessible residue surface 642 area, 1 means $<$20\%, 9 means $<$100\%, etc. X means unknown 643 structure. 644\end{sreitems} 645 646\subsection{recognized \#=GR annotations} 647\begin{sreitems}{\monob{SA}} 648\item [\monob{SS}] 649 Secondary structure consensus. See \mono{\#=GC SS\_cons} above. 650\item [\monob{SA}] 651 Surface accessibility consensus. See \mono{\#=GC SA\_cons} above. 652\item [\monob{PP}] Posterior probability for an aligned residue. This 653 represents the probability that this residue is assigned to the HMM 654 state corresponding to this alignment column, as opposed to some 655 other state. (Note that a residue can be confidently 656 \emph{unaligned}: a residue in an insert state or flanking N or C 657 state may have high posterior probability.) The posterior 658 probability is encoded as 11 possible characters \mono{0-9*+}: $0.0 659 \leq p < 0.05$ is coded as 0, $0.05 \leq p < 0.15$ is coded as 1, 660 (... and so on ...), $0.85 \leq p < 0.95$ is coded as 9, and $0.95 661 \leq p \leq 1.0$ is coded as '*'. Gap characters appear in the PP 662 line where no residue has been assigned. 663\end{sreitems} 664 665 666 667%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 668\newpage 669\section{A2M multiple alignment format} 670 671% Adapted from Easel documentation's format_a2m.tex by: 672% - including format_a2m.tex 673% - add the section header below 674% - revise first pp to be about HMMER. 675 676HMMER's Easel library routines are capable of writing alignments in UC 677Santa Cruz ``A2M'' (alignment to model) format, the native input 678format for the UCSC SAM profile HMM software package. 679 680To select A2M format, use the format code \mono{a2m}: for example, 681to reformat a Stockholm alignment to A2M: 682 683\user{esl-reformat a2m myali.sto} 684 685Easel currently does not read A2M format, and it currently only writes 686in what UCSC calls ``dotless'' A2M format. 687 688The most official documentation for A2M format appears to be at 689\url{http://compbio.soe.ucsc.edu/a2m-desc.html}. You may refer to that 690document if anything in the brief description below is unclear. 691 692\subsection{An example A2M file} 693 694This alignment: 695 696\begin{sreoutput} 697seq1 ACDEF...GHIKLMNPQTVWY 698seq2 ACDEF...GHIKLMNPQTVWY 699seq3 ---EFmnrGHIKLMNPQT--- 700\end{sreoutput} 701 702\noindent 703is encoded in A2M format as: 704 705\begin{sreoutput} 706>seq1 Sequence 1 description 707ACDEFGHIKLMNPQTVWY 708>seq2 Sequence 2 description 709ACDEFGHIKLMNPQTVWY 710>seq3 Sequence 3 description 711---EFmnrGHIKLMNPQT--- 712\end{sreoutput} 713 714A2M format looks a lot like aligned FASTA format. A crucial difference 715is that the aligned sequences in a ``dotless'' A2M file do not 716necessarily all have the same number of characters. The format 717distinguishes between ``consensus columns'' (where residues are in 718upper case and gaps are a dash, `-') and ``insert columns'' (where 719residues are in lower case and gaps are dots, `.', that aren't 720explicitly shown in the format -- hence ``dotless'' A2M). The position 721and number of gaps in insert columns (dots) is implicit in this 722representation. An advantage of this format is its compactness. 723 724This representation only works if all insertions relative to consensus 725are considered to be unaligned characters. That is how insertions are 726handled by profile HMM implementations like SAM and HMMER, and profile 727SCFG implementations like Infernal. 728 729Thus every sequence must have the same number of consensus columns 730(upper case letters plus `-' characters), and may have additional lower 731case letters for insertions. 732 733\subsection{Legal characters} 734 735A2M (and SAM) do not support some special characters such as `*' 736(not-a-residue) or `\mono{\~}' (missing data). Easel outputs these 737characters as gaps: either `-' in a consensus column, or nothing in an 738insert column. 739 740The SAM software parses only a subset of legal ambiguity codes for 741amino acids and nucleotides. For amino acids, it only reads \{BXZ\} in 742addition to the 20 standard one-letter codes. For nucleotides, it only 743reads \{NRY\} in addition to \{ACGTU\}. With one crucial exception, it 744treats all other letters as X or N. 745 746The crucial exception is `O'. SAM reads an `O' as the position of a 747``free insertion module'' (FIM), a concept specific to SAM-style 748profile HMMs. This has no impact on nucleic acid sequences, where `O' 749is not a legal character. But in amino acid sequences, `O' means 750pyrrolysine, one of the unusual genetically-encoded amino acids. This 751means that A2M format alignments must not contain pyrrolysine 752residues, lest they be read as FIMs. For this reason, Easel converts 753`O' residues to `X' when it writes an amino acid alignment in A2M 754format. 755 756\subsection{Determining consensus columns} 757 758Writing A2M format requires knowing which alignment columns are 759supposed to be considered consensus and which are considered 760inserts. If the alignment was produced by HMMER or Infernal, then the 761alignment has so-called ``reference annotation'' (what appears as a 762\mono{\#=GC RF} line in Stockholm format) marking the consensus 763columns. 764 765Often, an alignment has no reference annotation; for example, if it 766has been read from an alignment format that has no reference 767annotation line (only Stockholm and SELEX formats support reference 768annotation). In this case, Easel internally generates a ``reasonable'' 769guess at consensus columns, using essentially the same procedure that 770HMMER's \mono{hmmbuild} program uses by default: sequence fragments 771(sequences $<$50\% of the mean sequence length in the alignment 772overall) are ignored, and for the remaining sequences, any column 773containing $\geq$ 50\% residues is considered to be a consensus 774column. 775 776 777 778 779%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 780\newpage 781\section{hmmpgmd sequence database format} 782 783The hmmpgmd sequence database format closely resembles the 784FASTA format, with slight modification to support use within HMMER's 785\mono{hmmpgmd} daemon. 786 787 788The \mono{hmmpgmd} program enables search of one or more sequence 789databases (e.g. NR, SwissProt, UniProt) within a single instance, 790having loaded a single sequence file into memory. Because the set of 791sequences found within the various databases may overlap, the hmmpgmd 792format allows each sequence to be stored once, and includes a small piece of 793metadata that indicates, for each sequence, the list of source databases in 794which the sequence is found. When a search is performed in \mono{hmmpgmd}, a 795single target database is specified, and search is restricted to sequences 796belonging to that database. 797 798Furthermore, because a single sequence might be found multiple times 799within a single sequence database, \mono{hmmpgmd} is designed to compute 800E-values not just on the total number of non-redundant sequences 801within a database, but on the total number of sequences in the original 802(possibly redundant) database, provided those redundant counts are 803given in the hmmpgmd-formatted file. 804 805The \mono{hmmpgmd} file begins with a single line containing various counts 806describing the contents of the file, of the form 807 808\monob{\#res\_cnt seq\_cnt db\_cnt cnt\_1 fullcnt\_1 cnt\_2 fullcnt\_2 $\ldots$ date\_stamp} 809 810\subsection{Fields in header line} 811 812\begin{sreitems}{\monob{header}} 813 814\item [\monob{res\_cnt}] Number of residues in the sequence file. 815 816\item [\monob{seq\_cnt}] Number of sequences in the sequence file. 817 818\item [\monob{db\_cnt}] Number of databases in the sequence file. 819 820\item [\monob{cnt\_i}] Of the sequnces in the file, the number that belong to 821database \mono{i}. Note that if the file contains only a single 822database, this will match \mono{seq\_cnt}. 823 824\item [\monob{fullcnt\_i}] For database \mono{i}, the number of sequences 825that should be used in computing E-values. If redundant 826sequences were collapsed out of the original database, this may be 827larger than \mono{cnt\_i}. 828 829\end{sreitems} 830 831 832\subsection{FASTA-like sequence format} 833 834In the main body of the sequence file, database sequences are stored 835sequentially, with each entry consisting of a one-line FASTA-like 836header followed by one or more lines containing the amino acid sequence, 837like 838 839\begin{sreoutput} 840>1 100 841ACDEFGHIKLMNPQTVWY 842>2 010 843ACDKLMNPQTVWYEFGHI 844>3 111 845EFMNRGHIKLMNPQT 846\end{sreoutput} 847 848Note that the per-entry header line is made up of two parts. The first part 849is a numeric, incrementing sequence identifier (the i'th entry has the 850identifier \mono{i}). The second part is a bit string indicating database 851membership. In this example, sequence 1 is found in database 1, sequence 2 is 852found in database 2, and sequence 3 found in databases 1, 2, and 3. The number 853of bits in each bit string should match \mono{db\_cnt}. 854 855Because \mono{hmmpgmd} accepts only numeric sequence identifiers, it is 856necessary to keep track of the mapping between each numeric sequence identifier 857and the corresponding meta data (e.g. name, accession, description) external to 858the hmmpgmd-format file, and post-process \mono{hmmpgmd} seach results to 859apply those fields to the target sequence information. 860Depending on the size of the map list, this might be easily acheived with a 861simple perl array, or might require a more heavy-duty mapping backend such as 862mongodb (\url{http://www.mongodb.org}). 863 864 865\subsection{Creating a file in hmmpgmd format} 866 867The HMMER-Easel tool \mono{esl-reformat} is able to convert a file in unaligned 868fasta format into an hmmpgmd format file, such as 869 870\user{esl-reformat --id\_map mydb.hmmpgmd.map hmmpgmd mydb.fasta > mydb.hmmpgmd} 871 872The optional \mono{--id\_map} flag captures sequence name and description 873information into a simple tabular file, to be used for mapping those values 874back to \mono{hmmpgmd} query hits. 875 876 877 878%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 879\newpage 880\section{Score matrix files} 881 882Profile HMMs can be built from single sequences, not just from 883multiple sequence alignments. For example, the \mono{phmmer} and 884\mono{jackhmmer} search programs take a single sequence query as an 885input, and \mono{nhmmer} can as well. For single sequence queries, the 886probability parameters for residue alignments are derived from a 887substitution score matrix, such as BLOSUM62. Scores are converted to 888probabilities as described by Altschul (1991).\cite{Altschul91}. The 889scores can be arbitrary, but they must satisfy a couple of conditions 890so they can be interpreted as implicit log-odds probabilities: there 891must be at least one positive score, and the expected score (on 892nonhomologous alignments) must be negative. 893 894The default score matrix for protein alignments is BLOSUM62; for DNA, 895a matrix we call DNA1. Using the \mono{--mx} option (for programs that 896can use score matrices), you can choose instead one of several 897alternative protein score matrices: PAM30, PAM70, PAM120, PAM240, 898BLOSUM45, BLOSUM50, BLOSUM80, and BLOSUM90. For example, you could use 899\mono{--mx BLOSUM50}. 900 901The \mono{--mxfile} option allows you to provide a score matrix as a 902file. You can download many standard score matrice files from NCBI at 903\url{ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/}. HMMER can read 904\emph{almost} any of these files, with one exception: because it 905requires the score matrix to be symmetrical (same number of residues 906in rows and columns), the NCBI \mono{NUC.4.4} matrix for DNA works, 907but the alternative short-form format \mono{NUC.4.2} does not 908work. The scores in these two files are identical, so just use 909\mono{NUC.4.4}, and files like it. 910 911Here's a simple example file: 912 913\begin{sreoutput} 914# My score matrix 915# 916 A T G C 917A 1 -3 -3 -3 918T -3 1 -3 -3 919G -3 -3 1 -3 920C -3 -3 -3 1 921\end{sreoutput} 922 923In more detail, the rules for the format are: 924 925\begin{itemize} 926 927\item Blank lines are ignored. 928 929\item A \mono{\#} indicates a comment. Anything after it is 930 ignored. Lines that start with \mono{\#} are ignored like blank 931 lines. 932 933\item The first data line is a header line, labeling each column with 934 $n$ single residue characters (case-insensitive). A nucleic matrix 935 has $4 \leq n \leq 16$: at least the four canonical residues 936 \mono{ACGT} (or \mono{U}, for you RNA zealots), and it may also 937 contain any or all ambiguity codes \mono{RYMKSWHBVDN*}. A protein 938 matrix has $20 \leq n 27$; it must contain at least the 20 canonical 939 residues \mono{ACDEFGHIKLMNPQRSTVWY}, and may contain any or all 940 ambiguity codes \mono{BJZOU}. These residues can be in any order, 941 but the rows must be in the same order as the columns. 942 943\item The next $n$ data lines are the rows of a square $n \times n$ 944 score matrix: 945 946\begin{itemize} 947\item Fields in the row are whitespace-delimited (tabs or 948 spaces). 949 950\item Optionally, each row can start with its single residue label. 951 Therefore each row has either $n+1$ fields if there is a leading 952 label, or $n$ fields if not. Rows and columns are in the same order. 953 954\item Each score is an integer. Plus signs are optional. 955\end{itemize} 956 957\item The file may not contain any additional lines (other than 958 comments or blank lines). 959\end{itemize} 960 961HMMER only uses the scores for canonical residues, and uses them to 962calculate probabilities. If scores for ambiguous residue codes are 963provided, HMMER ignores them; it has its own logic for dealing with 964the probability of ambiguous residues, given the probability of 965canonical residues. 966 967 968 969 970 971 972 973