1
2\chapter{Tutorial}
3\label{chapter:tutorial}
4\setcounter{footnote}{0}
5
6
7
8First let's do something useful and see it work, then we'll do a
9complete walkthrough.
10
11%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
12\section{Tap, tap; is this thing on?}
13
14In the \mono{tutorial} subdirectory, \mono{globins4.sto} is an example
15of a basic Stockholm alignment file. \mono{hmmbuild} builds a profile
16from an alignment:\marginnote{\mono{hmmbuild} needs to be installed in
17  your \mono{PATH} to be able to just type the \mono{hmmbuild} command
18  like this. Otherwise you need to give a path to where
19  \mono{hmmbuild} is, which might be \mono{src/hmmbuild}, if you're in
20  the HMMER top level distribution directory.  If you're new to how
21  paths to programs and files work on the command line, skip
22  ahead to \hyperref[section:running]{running a HMMER program} for
23  some more detail.}
24
25  \vspace{1ex}
26  \user{\% cd tutorial} \\
27  \user{\% hmmbuild globins4.hmm globins4.sto}
28  \vspace{1ex}
29
30\mono{hmmsearch} searches a profile against a sequence database.  The
31file \mono{tutorial/globins45.fa} is a small example of a FASTA file
32containing 45 globin sequences:
33
34  \vspace{1ex}
35  \user{\% hmmsearch globins4.hmm globins45.fa}
36  \vspace{1ex}
37
38This will print an output of the search results, with a table of
39significant hits followed by their alignments.
40
41That's all you need to start using HMMER for work. You can build
42a profile of your favorite sequence alignment if you have one; you can
43also obtain alignments and profiles from
44Pfam.\sidenote{\href{http://pfam.xfam.org}{pfam.xfam.org}} You can
45obtain real sequence databases to search from
46NCBI\sidenote{\href{ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz}{ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz}}
47or
48UniProt\sidenote{\href{http://www.uniprot.org/downloads}{www.uniprot.org/downloads}}.
49You don't have to worry much about sequence file formats. HMMER can
50read most common alignment and sequence file formats
51automatically.
52
53
54
55
56%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
57\section {The programs in HMMER}
58
59In rough order of importance, the 18 HMMER programs are:
60
61\vspace{1ex}
62\begin{tabular}{rp{4in}}
63\monob{hmmbuild}    & build profile from input multiple alignment\\
64\monob{hmmalign}    & make multiple sequence alignment using a profile\\
65\monob{hmmsearch}   & search profile against sequence database\\
66\monob{hmmscan}     & search sequence against profile database\\
67\monob{hmmpress}    & prepare profile database for \mono{hmmscan}\\
68\monob{phmmer}      & search single sequence against sequence database\\
69\monob{jackhmmer}   & iteratively search single sequence against database\\
70\monob{nhmmer}      & search DNA query against DNA sequence database\\
71\monob{nhmmscan}    & search DNA sequence against a DNA profile database\\
72\monob{hmmfetch}    & retrieve profile(s) from a profile file \\
73\monob{hmmstat}     & show summary statistics for a profile file \\
74\monob{hmmemit}     & generate (sample) sequences from a profile \\
75\monob{hmmlogo}     & produce a conservation logo graphic from a profile\\
76\monob{hmmconvert}  & convert between different profile file formats \\
77\monob{hmmpgmd}     & search daemon for the \mono{hmmer.org} website \\
78\monob{hmmpgmd\_shard}     & sharded search daemon for the \mono{hmmer.org} website \\
79\monob{makehmmerdb} & prepare an \mono{nhmmer} binary database \\
80\monob{hmmsim}      & collect score distributions on random sequences\\
81\monob{alimask}     & add column mask to a multiple sequence alignment \\
82\end{tabular}
83\vspace{1ex}
84
85
86The programs \mono{hmmbuild}, \mono{hmmsearch}, \mono{hmmscan}, and
87\mono{hmmalign} are the core functionality for protein domain analysis
88and annotation pipelines, for instance using profile databases like
89Pfam.
90
91The \mono{phmmer} and \mono{jackhmmer} programs search a single
92protein sequence against a protein sequence database, akin to BLASTP
93and PSI-BLAST.  (Internally, they just produce a profile from the
94query sequence, then run profile searches.)
95
96\mono{nhmmer} is the equivalent of \mono{hmmsearch} and \mono{phmmer},
97but is capable of searching long, chromosome-size target DNA
98sequences.  \mono{nhmmscan} is the equivalent of \mono{hmmscan},
99capable of using chromosome-size DNA sequences as a query into a
100profile database.
101\marginnote{\mono{nhmmer} and \mono{nhmmscan} were added in HMMER3.1.}
102
103
104
105
106%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
107\section{Running a HMMER program}
108\label{section:running}
109
110After you compile HMMER, these programs are in the \mono{src/}
111subdirectory of the top-level HMMER directory. If you run them without
112arguments, they will give you a brief help message.\marginnote{If you
113  run a HMMER program with a \mono{-h} option and no arguments, it
114  will give you a brief summary of its usage and options.}  In this
115chapter, I will assume that you have installed them (with \mono{make
116  install}, perhaps) so they're in your \mono{PATH}. So if you type
117\mono{hmmbuild} at the command line, you see:
118
119  \vspace{1ex}
120  \user{\% hmmbuild}
121  \vspace{-1ex}
122  \xsreoutput{inclusions/hmmbuild-noargs.out}
123  \vspace{-1ex}
124
125If you haven't installed the HMMER programs, you need to specify both
126the program name and a path to it. This tutorial chapter assumes
127that you're in the \mono{tutorial} subdirectory, where the tutorial
128example data files are. From \mono{tutorial} , the relative
129path to the compiled programs is \mono{../src/}. So instead of just
130typing \mono{hmmbuild}, you could do: \marginnote{The \mono{\%}
131  represents your command prompt. It's not part of what you type.}
132
133  \vspace{1ex}
134  \user{\% ../src/hmmbuild}
135  \vspace{1ex}
136
137Make sure you can run a HMMER program like this before moving on.  If
138you are stuck getting something like \mono{hmmbuild: command not
139  found}, the unix shell isn't finding the program in your \mono{PATH}
140or you're not giving a correct explicit path. Consult your shell's
141documentation, or a friendly local unix guru.
142
143\section{Files used in the tutorial}
144
145The subdirectory called \mono{tutorial} in the HMMER distribution
146contains the files used in the tutorial. If you haven't already,
147change into that directory now.
148
149  \vspace{1ex}
150  \user{\% cd tutorial}
151  \vspace{1ex}
152
153If you do a \mono{ls}, you'll see there are several example files in
154the \mono{tutorial} directory:
155
156\begin{sreitems}{\monob{dna\_target.fa}}
157\item[\monob{globins4.sto}] An example alignment of four globin sequences, in
158  Stockholm format. This alignment is a subset of a famous old
159  published structural alignment from Don Bashford.\cite{Bashford87}
160%
161\item[\monob{globins45.fa}] 45 unaligned globin sequences, in FASTA
162  format.
163%
164\item[\monob{HBB\_HUMAN}] A FASTA file containing the sequence of
165  human $\beta-$hemoglobin.
166%
167\item[\monob{fn3.sto}] An example alignment of fibronectin type III
168  domains. This is a Pfam fn3 seed alignment, in Stockholm format.
169%
170\item[\monob{Pkinase.sto}] The Pfam Pkinase seed alignment of protein
171  kinase domains.
172%
173\item[\monob{7LESS\_DROME}] A FASTA file containing the sequence of
174  the \emph{Drosophila} Sevenless protein, a receptor tyrosine kinase
175  whose extracellular region consists of an array of several
176  fibronectin type III domains.
177%
178\item[\monob{MADE1.sto}] An example DNA alignment, a subset
179of the Dfam seed alignment for the MADE1 transposable element family.
180%
181\item[\monob{dna\_target.fa}] A 330Kb sequence from human chromosome
182  1 in FASTA format, containing four MADE1 elements.
183\end{sreitems}
184
185
186%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
187\section{On sequence file formats, briefly}
188
189Input files for HMMER include unaligned sequence files and multiple
190sequence alignment files. Sequence files and alignment files can come
191in many different poorly standardized formats.
192
193A commonly used format for unaligned sequence files and sequence
194databases is FASTA format. Several of the tutorial files give you
195examples (\mono{globins45.fa}, for example).
196
197HMMER's preferred alignment file format is Stockholm format, which is
198also the format that Pfam alignments are in.  Stockholm allows
199detailed annotation of columns, residues, and sequences, and HMMER is
200built to use this annotation.\marginnote{Stockholm format was
201  developed jointly with us by the Pfam curation team.} Stockholm also
202allows a file to contain many alignments for many families (a multiple
203multiple alignment file?). \mono{globins4.sto} is a simple example,
204and \mono{fn3.sto} is an example with a lot of annotation markup.
205
206HMMER can read several other sequence and alignment file formats. By
207default, it autodetects what format an input file is in.  Accepted
208unaligned sequence file formats include \mono{fasta}, \mono{uniprot},
209\mono{genbank}, \mono{ddbj}, and \mono{embl}. Accepted multiple
210alignment file formats include \mono{stockholm}, \mono{afa}
211(i.e.\ aligned FASTA), \mono{clustal}, \mono{clustallike} (MUSCLE,
212etc.), \mono{a2m}, \mono{phylip} (interleaved), \mono{phylips}
213(sequential), \mono{psiblast}, and \mono{selex}. These formats are
214described in detail in a later chapter. Where applicable, the programs
215have command line options for asserting an input format and skipping
216autodetection, when you don't want to depend on it.
217
218HMMER also automatically detects whether a sequence or alignment file
219contains nucleotide or protein sequence data. Like format
220autodetection, alphabet autodetection sometimes doesn't work on weird
221files. Where applicable, the programs have options (usually
222\mono{-{}-rna}, \mono{-{}-dna}, \mono{-{}-amino}) for asserting the
223input alphabet type.
224
225For more information in HMMER input files and formats, see the formats
226chapter on page \pageref{chapter:formats}.
227
228
229%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
230\section{Searching a sequence database with a profile}
231
232Now let's go through the \mono{hmmbuild}/\mono{hmmsearch} example in a
233bit more detail.
234
235\subsection{Step 1: build a profile with \mono{hmmbuild}}
236
237The file \mono{globins4.sto} looks like this:
238
239   \xsreoutput{inclusions/globins4.sto}
240
241Most popular alignment formats are similar block-based formats. They
242can be turned into Stockholm format with a little editing or
243scripting. Don't forget the \mono{\# STOCKHOLM 1.0} line at the start
244of the alignment, nor the \mono{//} at the end.
245
246Stockholm alignments can be concatenated to create an alignment
247database flatfile containing many alignments. The Pfam database, for
248example, distributes a single file containing representative
249alignments for thousands of sequence families.\marginnote{The Easel
250  miniapps provide tools for manipulating alignment files, such as
251  \mono{esl-afetch} for extracting one alignment by name or accession
252  from a Pfam file.}
253
254You ran \mono{hmmbuild} to build a profile from that
255alignment:
256
257   \vspace{1ex}
258   \user{\% hmmbuild globins4.hmm globins4.sto}
259   \vspace{1ex}
260
261and you got some output that looks like:
262
263  \xsreoutput{inclusions/hmmbuild-globins.out}
264
265If your input file had contained more than one alignment, you'd get
266one line of output for each profile. A single \mono{hmmbuild} command
267suffices to turn a Pfam seed alignment flatfile (such as
268\mono{Pfam-A.seed}) into a profile flatfile (such as
269\mono{Pfam.hmm}).
270
271The information on these lines is almost self-explanatory. The
272\mono{globins4} alignment consisted of \BGLnseq{} sequences with
273\BGLalen{} aligned columns (\mono{alen}). HMMER turned it into a profile
274of \BGLmlen{} consensus positions (\mono{mlen}), which means it
275defined \BGLgaps{} gap-containing alignment columns to be insertions
276relative to consensus. The \BGLnseq{} sequences were only counted as
277an ``effective'' total sequence number (\mono{eff\_nseq}) of
278\BGLeffn{}, because they're fairly similar to each
279other.\sidenote{It's not unusual for this number to be less than 1.
280 I'll explain why later.}
281The profile ended up with a relative entropy per position
282(\mono{re/pos}; average score, or information content) of \BGLre{}
283bits.
284
285The new profile was saved to \mono{globins4.hmm}. If you were to look at
286this file (and you don't have to -- it's intended for HMMER's
287consumption, not yours), you'd see something like:
288
289   \xsreoutput{inclusions/hmmbuild-globins.out2}
290
291The HMMER ASCII save file format is defined on
292page~\pageref{section:savefiles}.
293
294
295
296\subsection{Step 2: search the sequence database with hmmsearch}
297
298Presumably you have a sequence database to search. Here we'll use a
299UniProtKB/Swiss-Prot FASTA format flatfile (not provided in the
300tutorial, because of its large size), \mono{uniprot\_sprot.fasta}.  If
301you don't have a sequence database handy, run your example search
302against \mono{globins45.fa} instead, which is a FASTA format file
303containing 45 globin sequences.
304
305\mono{hmmsearch} accepts any FASTA file as target database input. It
306also accepts EMBL/UniProtKB text format, and Genbank format. It will
307automatically determine what format your file is in; you don't have to
308say. An example of searching a sequence database with our
309\mono{globins4.hmm} profile would look like:
310
311   \vspace{1ex}
312   \user{\% hmmsearch globins4.hmm uniprot\_sprot.fasta > globins4.out}
313   \vspace{1ex}
314
315Have a look at the output, \mono{globins4.out}.  The first section is
316the \emph{header} that tells you what program you ran, on what, and
317with what options:
318
319   \xsreoutput{inclusions/hmmsearch-globins.out}
320
321The second section is the \emph{sequence top hits} list. It is a list
322of ranked top hits (sorted by E-value, most significant hit first),
323formatted in a BLAST-like style:
324
325   \xsreoutput{inclusions/hmmsearch-globins.out2}
326
327The last two columns, obviously, are the name of each target sequence
328and optional description. The description line usually gets truncated
329just to keep line lengths down to something reasonable. If you want
330the full description line, and don't mind long output line lengths,
331use the \mono{-{}-notextw} option.
332
333The most important number here is the first one, the \emph{sequence
334  E-value}. The E-value is the expected number of false positives
335(nonhomologous sequences) that scored this well or better.  The
336E-value is a measure of statistical significance. The lower the
337E-value, the more significant the hit.  I typically consider
338sequences with E-values $< 10^{-3}$ or so to be significant hits.
339
340The E-value is based on the \emph{sequence bit score}, the second
341number. This is the log-odds score for the complete sequence.  Some
342people like to see a bit score instead of an E-value, because the bit
343score doesn't depend on the size of the sequence database, only on the
344profile and the target sequence. The E-value does depend on the
345size of the database you search: if you search a database ten times
346larger, you get ten times the number of false positives.
347
348The next number, the \emph{bias}, is a correction term for biased
349sequence composition that was applied to the sequence bit
350score.\marginnote{The method that HMMER uses to compensate for biased
351  composition remains unpublished, shamefully.}
352For instance, for the top hit \mono{\SGUseqname{}}
353that scored \SGUbitscore{} bits, the bias of \SGUbias{} bits means
354that this sequence originally scored \SGUorigscore{} bits, which was adjusted by
355the slight \SGUbias{} bit biased-composition correction. The only time you
356really need to pay attention to the bias value is when it's large, on
357the same order of magnitude as the sequence bit score. Sometimes
358(rarely) the bias correction isn't aggressive enough, and allows a
359non-homolog to retain too much score. Conversely, the bias correction
360can be too aggressive sometimes, causing you to miss homologs. You can
361turn the biased-composition score correction off with the
362\mono{-{}-nonull2} option.\sidenote{And if you're doing that, you may also want
363to set \mono{-{}-nobias}, to turn off another biased composition step
364called the bias filter, which affects which sequences get scored at
365all.}
366
367The next three numbers are again an E-value, score, and bias, but only
368for the single best-scoring domain in the sequence, rather than the
369sum of all its identified domains. The rationale for this isn't
370apparent in the globin example, because all the globins in this
371example consist of only a single globin domain. So let's set up a
372second example, using a profile of a single domain that's commonly
373found in multiple domains in a single sequence. Build a fibronectin
374type III domain profile using the \mono{fn3.sto}
375alignment.\sidenote{This happens to be a Pfam seed alignment. It's a
376  good example of an alignment with complex Stockholm annotation.}
377Then use that profile to analyze the sequence \mono{7LESS\_DROME}, the
378\emph{Drosophila} Sevenless receptor tyrosine kinase:
379
380   \vspace{1ex}
381   \user{\% hmmbuild fn3.hmm fn3.sto} \\
382   \user{\% hmmsearch fn3.hmm 7LESS\_DROME > fn3.out}
383   \vspace{1ex}
384
385In \mono{fn3.out}, the sequence top hits list says:
386
387   \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out}
388
389OK, now let's pick up the explanation where we left off. The total
390sequence score of \SFSbitscore{} sums up \emph{all} the fibronectin III domains
391that were found in the \mono{7LESS\_DROME} sequence. The ``single best
392dom'' score and E-value are the bit score and E-value as if the target
393sequence only contained the single best-scoring domain, without this
394summation.
395
396The idea is that we might be able to detect that a sequence is a
397member of a multidomain family because it contains multiple
398weakly-scoring domains, even if no single domain is solidly
399significant on its own.  On the other hand, if the target sequence
400happened to be a piece of junk consisting of a set of identical
401internal repeats, and one of those repeats accidentially gives a weak
402hit to the query profile, all the repeats will sum up and the sequence
403score might look ``significant''.\sidenote{Mathematically, alas, the
404  correct answer: the null hypothesis we're testing against is that
405  the sequence is a \emph{random} sequence of some base composition,
406  and a repetitive sequence isn't random.}
407
408So operationally:
409\begin{itemize}
410\item if both E-values are significant ($<<1$), the sequence is likely
411      to be homologous to your query.
412\item if the full sequence E-value is significant but the single best domain
413      E-value is not, the target sequence is probably a multidomain remote
414      homolog; but be wary, and watch out for the case where it's just a repetitive
415      sequence.
416\end{itemize}
417
418OK, the sharp eyed reader asks, if that's so, then why in the globin4
419output (all of which have only a single domain) do the full sequence
420bit scores and best single domain bit scores not exactly agree? For
421example, the top ranked hit, \mono{\SGUseqname{}},
422has a full sequence score of \SGUbitscore{} and a single
423best domain score of \SGUdombitscore{}. What's going on? What's going on is that
424the position and alignment of that domain is uncertain -- in this
425case, only very slightly so, but nonetheless uncertain. The full
426sequence score is summed over all possible alignments of the globin
427profile to the \mono{\SGUseqname{}} sequence. When HMMER identifies
428domains, it identifies what it calls an \textbf{envelope} bounding
429where the domain's alignment most probably lies.\marginnote{The
430  difference between envelopes and alignments comes up again below
431when we discuss the reported coordinates of domains and alignments in
432the next section of the output.} The ``single best dom'' score is
433calculated after the domain envelope has been defined, and the
434summation is restricted only to the ensemble of possible alignments
435that lie within the envelope. The fact that the two scores are
436slightly different is therefore telling you that there's a small
437amount of probability (uncertainty) that the domain lies somewhat
438outside the envelope bounds that HMMER has selected.
439
440The two columns headed \mono{\#dom} are two different estimates of
441the number of distinct domains that the target sequence contains. The
442first, the column marked \mono{exp}, is the \emph{expected} number of
443domains according to HMMER's statistical model. It's an average,
444calculated as a weighted marginal sum over all possible
445alignments. Because it's an average, it isn't necessarily a round
446integer. The second, the column marked \mono{N}, is the number of
447domains that HMMER's domain postprocessing and annotation pipeline
448finally decided to identify, annotate, and align in the target
449sequence. This is the number of alignments that will show up in the
450domain report later in the output file.
451
452These two numbers should be about the same. Rarely, you might see that
453they're very different, and this would usually be a sign that the
454target sequence is so highly repetitive that it's confused the HMMER
455domain postprocessor.\sidenote{Such sequences aren't likely to show up as
456significant homologs to any sensible query in the first place.}
457
458The sequence top hits output continues until it runs out of sequences
459to report. By default, the report includes all sequences with an
460E-value of 10.0 or less. It's showing you the top of the noise, so you
461can decide for yourself what's interesting or not: the default output
462is expected to contain about 10 false positives with E-values in the
463range of about 1-10.
464
465Then comes the third output section, which starts with
466
467   \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out2}
468
469Now for each sequence in the top hits list, there will be a section
470containing a table of where HMMER thinks all the domains are,
471followed by the alignment inferred for each domain. Let's use the
472\mono{fn3} vs. \mono{7LESS\_DROME} example, because it contains lots
473of domains, and is more interesting in this respect than the globin4
474output.  The domain table for \mono{7LESS\_DROME} looks like:
475
476   \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out3}
477
478Domains are reported in the order they appear in the sequence, not in
479order of their significance.
480
481The \mono{!} or \mono{?} symbol indicates whether this domain does or
482does not satisfy both per-sequence and per-domain inclusion
483thresholds. Inclusion thresholds are used to determine what matches
484should be considered to be ``true'', as opposed to reporting
485thresholds that determine what matches will be reported.\sidenote{The
486  default reporting threshold of 10.0 is chosen so you get to see
487  about $\sim$10 insignificant hits at the top of the noise, so you
488  can see what interesting sequences might be getting tickled by your
489  search.} By default, inclusion thresholds usually require a
490per-sequence E value of 0.01 or less and a per-domain conditional
491E-value of 0.01 or less (except jackhmmer, which requires a more
492stringent 0.001 for both), and reporting E-value thresholds are set to
49310.0.
494
495The bit score and bias values are as described above for sequence
496scores, but are the score of just one domain's envelope.
497
498The first of the two E-values is the \textbf{conditional
499  E-value}.\marginnote{The conditional E-value is a weird statistic,
500  and it's not clear I'm going to keep it.} It is an attempt to
501measure the statistical significance of each domain, \emph{given that
502  we've already decided that the target sequence overall is a true
503  homolog}.  It is the expected number of \emph{additional} domains
504we'd find with a domain score this big in the set of sequences
505reported in the top hits list, if those sequences consisted only of
506random nonhomologous sequence outside the best region that sufficed to
507define them as homologs.
508
509The second number is the \textbf{independent E-value}: the
510significance of the sequence in the \emph{whole} database search, if
511this were the only domain we had identified. It's exactly the same as
512the ``best 1 domain'' E-value in the sequence top hits list.
513
514The different between the two E-values is not apparent in the
515\mono{7LESS\_DROME} example because in both cases, the size of the
516search space as 1 sequence. There's a single sequence in the target
517sequence database (that's the size of the search space that the
518independent/best single domain E-value depends on). There's one
519sequence reported as a putative homolog in the sequence top hits list
520(that's the size of the search space that the conditional E-value
521depends on). A better example is to see what happens when we search
522UniProt (I used release \UNIrelease{}, which contains \UNInseq{} sequences)
523with the \mono{fn3} profile:
524
525   \vspace{1ex}
526   \user{\% hmmsearch fn3.hmm uniprot\_sprot.fasta}
527   \vspace{1ex}
528
529(If you don't have UniProt and can't run a command like this, don't
530worry about it - I'll show the relevant bits here.) Now the domain
531report for \mono{7LESS\_DROME} looks like:
532
533   \xsreoutput{inclusions/hmmsearch-fn3-uniprot.out}
534
535Notice that \emph{almost} everything's the same (it's the same target
536sequence, after all) \emph{except} for what happens with E-values. The
537independent E-value is calculated assuming a search space of all
538\UNInseq{} sequences. For example, look at the highest scoring domain
539(domain \SFSmaxdomu{} here; domain \SFSmaxdom{} above). When we only looked at a single
540sequence, its score of \SFSmaxsc{} bits has an E-value of \SFSievalue{}. When we
541search a database of \UNInseq{} sequences, a hit scoring \SFSmaxsc{} bits would
542be expected to happen \UNInseq{} times as often: \SFSievalue{} $\times$ \UNInseq{}
543$=$ \SFSuievalue{}. In this UniProt
544search, \SFSdomZ{} sequences were reported in the top hits list (with
545E-values $\leq 10$). If we were to assume that all \SFSdomZ{} are true
546homologs, x out the domain(s) that made us think that, and then went
547looking for \emph{additional} domains in those \SFSdomZ{} sequences, we'd be
548searching a smaller database of \SFSdomZ{} sequences: the expected number of
549times we'd see a hit of \SFSmaxsc{} bits or better is now \SFSievalue{} $\times$
550\SFSdomZ{} $=$ \SFSucevalue.\marginnote{If you calculate this
551  yourself, you may see some small discrepancies
552due to floating point roundoff.} That's where the conditional E-value (c-Evalue) is
553coming from.
554
555Notice that a couple of domains disappeared in the UniProt search,
556because now, in this larger search space size, they're not
557significant. Domain \SFSaidx{} (the one with the score of \SFSascore{}
558bits) got a conditional E-value of \SFSaevalue{} $\times$ \SFSdomZ{} =
559\SFSauevalue{}, and domain \SFSbidx{} (with a bit score of
560\SFSbscore{}) got a c-Evalue of \SFSbevalue{} $\times$ \SFSdomZ =
561\SFSbuevalue{}. These fail the default reporting threshold of
56210.0. Also, the domains with scores of \SFSainsig{} and \SFSbinsig{}
563shifted from being above to below the default inclusion thresholds, so
564now they're marked with \mono{?} instead of \mono{!}.
565
566Operationally:
567
568\begin{itemize}
569\item If the independent E-value is significant ($<<1$), that means
570that even this single domain \emph{by itself} is such a strong hit
571that it suffices to identify the sequence as a significant homolog
572with respect to the size of the entire original database search. You
573can be confident that this is a homologous domain.
574
575\item Once there's one or more high-scoring domains in the sequence
576already, sufficient to decide that the sequence contains homologs of
577your query, you can look (with some caution) at the conditional
578E-value to decide the statistical significance of additional
579weak-scoring domains.
580\end{itemize}
581
582In the UniProt output, for example, we'd be pretty sure of four of the
583domains (1, 4, 5, and maybe 6), each of which has a strong enough
584independent E-value to declare \mono{7LESS\_DROME} to be an
585fnIII-domain-containing protein. Domain 2 wouldn't be significant if
586it was all we saw in the sequence, but once we decide that
587\mono{7LESS\_DROME} contains fn3 domains on the basis of the other
588hits, its conditional E-value indicates that it's probably an fn3
589domain too. Domains 3 and 7 (the ones marked by \mono{?}) are too weak
590to be sure of, from this search alone, but would be something to pay
591attention to.
592
593The next four columns give the endpoints of the reported local
594alignment with respect to both the query profile (\mono{hmmfrom} and
595\mono{hmm to}) and the target sequence (\mono{alifrom} and \mono{ali
596  to}).
597
598It's not immediately easy to tell from the ``to'' coordinate whether
599the alignment ended internally in the query or target, versus ran all
600the way (as in a full-length global alignment) to the end(s). To make
601this more readily apparent, with each pair of query and target
602endpoint coordinates, there's also a little symbology: \mono{..}
603means both ends of the alignment ended internally, \mono{[]}
604means both ends of the alignment were full-length flush to the ends of
605the query or target, and \mono{[.} and \mono{.]} mean only the left or
606right end was flush/full length.
607
608The next two columns (\mono{envfrom} and \mono{env to}) define the
609\emph{envelope} of the domain's location on the target sequence.  The
610envelope is almost always a little wider than what HMMER chooses to
611show as a reasonably confident alignment. As mentioned earlier, the
612envelope represents a subsequence that encompasses most of the
613posterior probability for a given homologous domain, even if precise
614endpoints are only fuzzily inferrable.\marginnote{You'll notice that for higher
615scoring domains, the coordinates of the envelope and the inferred
616alignment will tend to be in tighter agreement, corresponding to
617sharper posterior probability defining the location of the homologous
618region.}
619
620Operationally, I recommend using the envelope coordinates to annotate
621domain locations on target sequences, not the alignment
622coordinates. Be aware that when two weaker-scoring domains are close
623to each other, envelope coordinates (and, rarely, sometimes even
624alignment coordinates) can and will overlap, corresponding to the
625overlapping uncertainty of where one domain ends and another begins.
626
627The last column is the average posterior probability of the aligned
628target sequence residues; effectively, the expected accuracy per
629residue of the alignment.
630
631For comparison, current UniProt consensus annotation of Sevenless
632shows seven domains:
633
634   \xsreoutput{inclusions/sevenless_domains.out}
635
636These agree (modulo differences in start/endpoints) with the seven
637strongest domains identified by HMMER. The weaker partial domain hits
638(at \SFSacoords{} and \SFSbcoords{}) are also plausible homologies, given that
639the extracellular domain of Sevenless is pretty much just a big array
640of $\sim$100aa fibronectin repeats.
641
642Under the domain table, an ``optimal posterior accuracy''
643alignment\cite{Holmes98} is computed within each domain's envelope
644and displayed. For example, (skipping domain 1 because it's weak and
645unconvincing), fibronectin III domain 2 in your \mono{7LESS\_DROME}
646output is shown as:
647
648   \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out4}
649
650The initial header line starts with a \mono{==} as a little handle for
651a parsing script to grab hold of. I may put more information on that line
652eventually.
653
654If the profile had any consensus structure or reference line annotation
655that it inherited from your multiple alignment (\mono{\#=GC SS\_cons},
656\mono{\#=GC RF} annotation in Stockholm files), that information is
657simply regurgitated as \mono{CS} or \mono{RF} annotation lines
658here. The \mono{fn3} profile had a consensus structure annotation line.
659
660The line starting with \mono{fn3} is the consensus of the query
661profile: the residue with the highest emission probability at each
662position.\sidenote{For a single sequence query in \mono{phmmer}, the
663  consensus is the sequence itself. Oddly, this is not necessarily the
664  highest probability sequence; for example, under a BLOSUM62 scoring
665  matrix, where the query has an M, you are more likely to see an
666  aligned L, not an M.} Capital letters represent particularly
667highly conserved positions.\sidenote{For protein models, $\geq$ 50\%
668  emission probability; for DNA/DNA, $\geq$ 90\%.}
669Dots (\mono{.}) in this line indicate insertions
670in the target sequence with respect to the profile.
671
672The midline indicates matches between the query profile and target
673sequence. A letter indicates an exact match to the profile consensus.
674A \mono{+} indicates that this residue has a positive log-odds
675emission score, a ``conservative substitution'' given what the profile
676expects at that position.\sidenote{That is, the emission probability
677  $e(a)$ for this aligned residue $a$ is $> f_a$, its background
678  frequency: it's a likely residue, just not the most likely one.}
679
680The line starting with \mono{7LESS\_DROME} is the target sequence.
681Dashes (\mono{-}) in this line indicate deletions in the target
682sequence with respect to the profile.
683
684The bottom line represents the posterior
685probability (essentially the expected accuracy) of each aligned
686residue. A 0 means 0-5\%, 1 means 5-15\%, and so on; 9 means 85-95\%,
687and a \mono{*} means 95-100\% posterior probability. You can use these
688posterior probabilities to decide which parts of the alignment are
689well-determined or not. You'll often observe, for example, that
690expected alignment accuracy degrades around locations of insertion and
691deletion, which you'd intuitively expect.
692
693You'll also see expected alignment accuracy degrade at the ends of an
694alignment -- this is because ``alignment accuracy'' posterior
695probabilities currently not only includes whether the residue is
696aligned to one profile position versus others, but also confounded with
697whether a residue should be considered to be homologous (aligned to
698the profile somewhere) versus not homologous at all.\marginnote{It may
699make more sense to condition the posterior probabilities on the
700assumption that the residue is indeed homologous: given that, how
701likely is it that we've got it correctly aligned.}
702
703
704These domain table and per-domain alignment reports for each sequence
705then continue, for each sequence that was in the per-sequence top hits
706list.
707
708Finally, at the bottom of the file, you'll see some summary
709statistics.  For example, at the bottom of the globins search output,
710you'll find something like:
711
712   \xsreoutput{inclusions/hmmsearch-globins.out3}
713
714This gives you some idea of what's going on in HMMER's acceleration
715pipeline. You've got one query profile, and the database has \UNInseq{}
716target sequences. Each sequence goes through a gauntlet of three
717scoring algorithms called MSV, Viterbi, and Forward, in order of
718increasing sensitivity and increasing computational requirement.
719
720MSV (the ``Multi ungapped Segment Viterbi'' algorithm) essentially
721calculates the HMMER equivalent of BLAST's sum score -- an optimal sum
722of ungapped high-scoring alignment segments. Unlike BLAST, it does
723this calculation directly, without BLAST's word hit or hit extension
724step, using a SIMD vector-parallel algorithm. By default, HMMER is
725configured to allow sequences with a P-value of $\leq 0.02$ through
726the MSV score filter.\sidenote{Thus, if the database contained no homologs and
727P-values were accurately calculated, the highest scoring 2\% of the
728sequences will pass the filter.} Here, for this globin search, about
729\SGUmsvpass{}\% of the database got through the MSV filter.
730
731A quick check is then done to see if the target sequence is
732``obviously'' so biased in its composition that it's unlikely to be a
733true homolog. This is called the ``bias filter''. If you don't like it
734(it can occasionally be overaggressive) you can shut it off with the
735\mono{-{}-nobias} option. Here, \SGUbiaspass{} sequences pass through the bias
736filter.
737
738The Viterbi filter then calculates a gapped optimal alignment score.
739This is more sensitive than the MSV score, but the Viterbi filter is
740about four-fold slower than MSV. By default, HMMER lets sequences
741with a P-value of $\leq 0.001$ through this stage. Here, because
742there's about a thousand true globin homologs in this database, more
743than that gets through: \SGUvitpass{} sequences.
744
745Then the full Forward score is calculated, which sums over all
746possible alignments of the profile to the target sequence. The default
747allows sequences with a P-value of $\leq 10^{-5}$ through: \SGUfwdpass{}
748sequences pass.
749
750All sequences that make it through the three filters are then
751subjected to a full probabilistic analysis using the HMM
752Forward/Backward algorithms, first to identify domains and assign
753domain envelopes; then within each individual domain envelope,
754Forward/Backward calculations are done to determine posterior
755probabilities for each aligned residue, followed by optimal accuracy
756alignment. The results of this step are what you finally see on the
757output.
758
759Recall the difference between conditional and independent E-values,
760with their two different search space sizes. These search space sizes
761are reported in the statistics summary.
762
763Finally, it reports the speed of the search in units of Mc/sec
764(million dynamic programming cells per second), the CPU time, and the
765elapsed time. This search took about \SGUelapsed{} seconds of elapsed
766(wall clock) time.
767
768
769
770%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
771\section{Single sequence protein queries using phmmer}
772
773The \mono{phmmer} program is for searching a single sequence query
774against a sequence database, much as BLASTP or FASTA would
775do. \mono{phmmer} works essentially just like \mono{hmmsearch} does,
776except you provide a query sequence instead of a query profile.
777
778Internally, HMMER builds a profile from your single query
779sequence, using a simple position-independent scoring system (BLOSUM62
780scores converted to probabilities, plus a gap-open and gap-extend
781probability).
782
783The file \mono{tutorial/HBB\_HUMAN} is a FASTA file containing the
784human $\beta-$globin sequence as an example query. If you have a
785sequence database such as \mono{uniprot\_sprot.fasta}, make that your
786target database; otherwise, use \mono{tutorial/globins45.fa} as a
787small example:
788
789   \vspace{1ex}
790   \user{\% phmmer HBB\_HUMAN uniprot\_sprot.fasta}
791   \vspace{1ex}
792
793or
794
795   \vspace{1ex}
796   \user{\% phmmer HBB\_HUMAN globins45.fa}
797   \vspace{1ex}
798
799Everything about the output is essentially as previously described for
800\mono{hmmsearch}.
801
802
803
804
805%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
806\section{Iterative protein searches using jackhmmer}
807
808The \mono{jackhmmer} program is for searching a single sequence query
809iteratively against a sequence database, much as PSI-BLAST would do.
810
811The first round is identical to a \mono{phmmer} search. All the
812matches that pass the inclusion thresholds are put in a multiple
813alignment. In the second (and subsequent) rounds, a profile is made
814from these results, and the database is searched again with the
815profile.
816
817Iterations continue either until no new sequences are detected or the
818maximum number of iterations is reached. By default, the maximum
819number of iterations is 5; you can change this with the \mono{-N}
820option.
821
822Your original query sequence is always included in the multiple
823alignments, whether or not it appears in the database.\marginnote{If it
824  \emph{is} in the database, it will almost certainly be included in
825  the internal multiple alignment twice, once because it's the query
826  and once because it's a significant database match to itself. This
827  redundancy won't screw anything up, because sequences are
828  downweighted for redundancy anyway.}
829The ``consensus'' columns assigned to each multiple alignment always
830correspond exactly to the residues of your query, so the coordinate
831system of every profile is always the same as the numbering of
832residues in your query sequence, 1..L for a sequence of length L.
833
834Assuming you have UniProt or something like it handy, here's an
835example command line for a jackhmmer search:
836
837   \vspace{1ex}
838   \user{\% jackhmmer HBB\_HUMAN uniprot\_sprot.fasta}
839   \vspace{1ex}
840
841One difference from \mono{phmmer} output you'll notice is that
842\mono{jackhmmer} marks ``new'' sequences with a \mono{+} and ``lost''
843sequences with a \mono{-}. New sequences are sequences that pass the
844inclusion threshold(s) in this round, but didn't in the round before.
845Lost sequences are the opposite: sequences that passed the inclusion
846threshold(s) in the previous round, but have now fallen beneath (yet
847are still in the reported hits -- it's possible, though rare, to lose
848sequences utterly, if they no longer even pass the reporting
849threshold(s)).  In the first round, everything above the inclusion
850thresholds is marked with a \mono{+}, and nothing is marked with a
851\mono{-}. For example, the top of this output looks like:
852
853   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out}
854
855That continues until the inclusion threshold is reached, at which
856point you see a tagline ``inclusion threshold'' indicating where the
857threshold was set:
858
859   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out2}
860
861The domain output and search statistics are then shown just as in
862\mono{phmmer}. At the end of this first iteration, you'll see some
863output that starts with \mono{@@} (this is a simple tag that lets you
864search through the file to find the end of one iteration and the
865beginning of another):
866
867   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out3}
868
869This (obviously) is telling you that the new alignment contains \JHUninc{}
870sequences, your query plus \JHUnsig{} significant matches. For round two,
871it's built a new profile from this alignment. Now for round two, it
872fires off what's essentially an \mono{hmmsearch} of the target
873database with this new profile:
874
875   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out4}
876
877If you skim down through this output, you'll start seeing newly
878included sequences marked with \mono{+}'s, such as:
879
880   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out5}
881
882It's less usual to see sequences get lost (and marked with \mono{-}),
883but it happens.
884
885After round 2, more distant globin sequences have been found:
886
887   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out6}
888
889Because new sequences were included, it keeps going to round three,
890and then again to round four. After round four, the search ends
891because it didn't find any new hits; it considers the search to be
892``converged'' and it stops. (It would also eventually stop at a
893certain maximum number of iterations; the default maximum is 5, and
894you can set a different maximum with the \mono{-N} option.)  The end
895of the output is:
896
897   \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out7}
898
899The \mono{//} marks the end of the results for one query. You could
900search with more than one query in your input query sequence
901file.
902
903There is an \mono{[ok]} at the end of the search output as a signal
904that the search successfully completed. This might be useful if you're
905automating lots of searches and you want to be sure that they worked.
906
907
908
909%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
910\section{Searching a profile database with a query sequence}
911
912Rather than searching a single profile against a collection of
913sequences, you might want to wish to annotate a single sequence by
914searching it against a collection of profiles of different domains.
915\mono{hmmscan} takes as input a query file containing one or more
916sequences to annotate, and a profile database to search them against.
917The profile database might be Pfam, SMART, or TIGRFams, for example, or
918another collection of your choice. \marginnote{Either \mono{hmmsearch}
919  or \mono{hmmscan} can compare a set of profiles to a set of
920  sequences. Due to disk access patterns of the two tools, it is
921  usually more efficient to use \mono{hmmsearch}, unless the number of
922  profiles greatly exceeds the number of sequences.}
923
924\subsection{Step 1: create a profile database file}
925
926A profile ``database'' file is just a concatenation of individual
927profile files. To create a database file, you can either build
928individual profile files and concatenate them, or you can concatenate
929Stockholm alignments and use \mono{hmmbuild} to build a profile database
930from them in one command.
931
932Let's create a tiny database called \mono{minifam} containing profiles
933of globin, fn3, and Pkinase (protein kinase) domains by concatenating
934profile files:
935
936   \vspace{1ex}
937   \user{\% hmmbuild globins4.hmm globins4.sto}\\
938   \user{\% hmmbuild fn3.hmm fn3.sto}\\
939   \user{\% hmmbuild Pkinase.hmm Pkinase.sto}\\
940   \user{\% cat globins4.hmm fn3.hmm Pkinase.hmm > minifam}
941   \vspace{1ex}
942
943We'll use \mono{minifam} for our examples in just a bit, but first a
944few words on other ways to build profile databases, especially big ones.
945
946The other way to do it is to start with an \emph{alignment} database
947flatfile -- a concatenation of many Stockholm files -- and use
948\mono{hmmbuild} to build a profile database file from it.  For
949example, you could obtain the big \mono{Pfam-A.seed} and/or
950\mono{Pfam-A.full} Stockholm-format alignment flatfiles from Pfam.
951\mono{hmmbuild} names each profile according to a \mono{\#=GF ID}
952annotation line in each Stockholm alignment. Normally the \mono{ID}
953line is optional in Stockholm format, but \mono{hmmbuild} has to name
954your new profile(s) somehow. For a single alignment, it will use your
955filename, or you can use the \mono{hmmbuild -n <name>} option to
956provide a name yourself. For an alignment database, the only way
957\mono{hmmbuild} can get a name for each alignment is from alignment
958annotation.\marginnote{For example, it won't work if you concatenate
959  \mono{globins4.sto} with other Stockholm files, because the simple
960  little \mono{globins4.sto} alignment doesn't have an \mono{ID}
961  line.}  Of alignment file formats, only Stockholm format provides a
962way to concatenate many alignments in the same file, with a name for
963each alignment. For example, from a Pfam seed alignment flatfile
964\mono{Pfam-A.seed}, you can do:
965
966   \vspace{1ex}
967   \user{\% hmmbuild Pfam-A.hmm Pfam-A.seed}
968   \vspace{1ex}
969
970This would probably take a couple of hours to build all 20,000 profiles
971or so in Pfam. To speed the database construction process up,
972\mono{hmmbuild} supports MPI parallelization. Running MPI programs can
973be a little arcane, so skip this bit if you're not in the mood.
974
975As far as HMMER's concerned, all you have to do is add \mono{-{}-mpi}
976to the command line for \mono{hmmbuild} to tell it to run in MPI
977master/worker mode across many cores and/or machines, assuming you've
978compiled support for MPI into it (see the installation instructions).
979You'll also need to know how to invoke an MPI job in your particular
980cluster environment, with your job scheduler and your MPI
981distribution.\sidenote{I can't really help you with this. Different sites
982have different cluster, scheduler, and MPI environments. Consult a
983local guru, as they say.} In general, you will launch the parallel
984\mono{hmmbuild} by using a command like \mono{mpirun} or \mono{srun}
985that manages the MPI environment for a specified number of processes.
986With the SGE (Sun Grid Engine) scheduler and Intel MPI, an example
987incantation for building \mono{Pfam.hmm} from \mono{Pfam-A.seed} in
988parallel across 128 processes:
989
990   \vspace{1ex}
991   \user{\% qsub -N hmmbuild -j y -o errors.out -b y -cwd -V -pe impi 128 \textbackslash}\\
992   \user{   'mpirun -np 128 ./hmmbuild --mpi Pfam.hmm Pfam-A.seed > hmmbuild.out'}
993   \vspace{1ex}
994
995or, an example SLURM incantation (on the \mono{eddy} group partition
996on our Harvard cluster):
997
998   \vspace{1ex}
999   \begin{fullwidth}
1000   \user{\indent \% sbatch -J hmmbuild -e hmmbuild.err -o hmmbuild.out -p eddy -n 128 -t 6-00:00 --mem-per-cpu=4000 \textbackslash}\\
1001   \user{\indent   --wrap="srun -n 128 --mpi=pmi2 ./hmmbuild --mpi Pfam-A.hmm Pfam-A.seed"}
1002   \end{fullwidth}
1003   \vspace{1em}
1004
1005This reduces the time to build all of Pfam to about 40 seconds.
1006
1007
1008
1009\subsection{Step 2: compress and index the flatfile with hmmpress}
1010
1011\mono{hmmscan} has to read a lot of profiles in a hurry, and
1012HMMER's text flatfiles are bulky. To accelerate this, \mono{hmmscan}
1013depends on binary compression and indexing of the flatfiles.  First
1014you compress and index your profile database with the \mono{hmmpress}
1015program:
1016
1017   \vspace{1ex}
1018   \user{\% hmmpress minifam}
1019   \vspace{1ex}
1020
1021This will produce:
1022
1023   \xsreoutput{inclusions/hmmpress-minifam.out}
1024
1025and you'll see these four new binary files in the directory.\sidenote{
1026Their format is ``proprietary'', an open source term of art that means
1027both ``I haven't found time to document them yet'' and ``I still might
1028decide to change them arbitrarily without telling you''.}
1029
1030\subsection{Step 3: search the profile database with hmmscan}
1031
1032Now we can analyze sequences using our profile database and
1033\mono{hmmscan}.
1034
1035For example, the receptor tyrosine kinase \mono{7LESS\_DROME} not only
1036has all those fibronectin type III domains on its extracellular side,
1037it's got a protein kinase domain on its intracellular side. Our
1038\mono{minifam} database has profiles of both \mono{fn3} and
1039\mono{Pkinase}, as well as the unrelated \mono{globins4} profile. So
1040what happens when we scan the \mono{7LESS\_DROME} sequence:
1041
1042   \vspace{1ex}
1043   \user{\% hmmscan minifam 7LESS\_DROME}
1044   \vspace{1ex}
1045
1046The header and the first section of the output will look like:
1047
1048   \xsreoutput{inclusions/hmmscan-minifam-sevenless.out}
1049
1050The output fields are in the same order and have the same meaning as
1051in \mono{hmmsearch}'s output.
1052
1053The size of the search space for \mono{hmmscan} is the number of
1054profiles in the profile database (here, 3; for a Pfam search, on the order
1055of 20000). In \mono{hmmsearch}, the size of the search space is the
1056number of sequences in the sequence database. This means that E-values
1057may differ even for the same individual profile vs. sequence
1058comparison, depending on how you do the search.
1059
1060For domain, there then follows a domain table and alignment output,
1061just as in \mono{hmmsearch}. The \mono{fn3} annotation, for example,
1062looks like:
1063
1064   \xsreoutput{inclusions/hmmscan-minifam-sevenless.out2}
1065
1066and an example alignment (of that second domain again):
1067
1068   \xsreoutput{inclusions/hmmscan-minifam-sevenless.out3}
1069
1070You'd probably expect that except for the E-values (which depend on
1071database search space sizes), you should get exactly the same scores,
1072domain number, domain coordinates, and alignment every time you do a
1073comparison of the same profile against the same sequence. Which is
1074actually the case! But in fact, under the hood, it's actually not so
1075obvious this should be, and HMMER is actually going out of its way to
1076make it so. HMMER uses stochastic sampling algorithms to infer some
1077parameters, and also to infer the exact domain number and domain
1078boundaries in certain difficult cases. If HMMER ran its stochastic
1079samples ``properly'', it would obtain different samples every time you
1080ran a program, and all of you would complain to me that HMMER was
1081weird and buggy because it gave different answers on the same
1082problem. To suppress run-to-run variation, HMMER seeds its random
1083number generator(s) \emph{identically} every time you do a sequence
1084comparison. If you're a stats expert, and you really want to see the
1085proper stochastic variation that results from sampling algorithms, you
1086can pass a command-line argument of \mono{-{}-seed 0} to programs that
1087have this property (hmmbuild and the four search programs).
1088
1089
1090
1091\subsection{Summary statistics for a profile database: hmmstat}
1092
1093Our \mono{minifam} profile ``database'' example only contains three
1094profiles, but real profile databases like Pfam can contain many
1095thousands. The \mono{hmmstat} program is a utility that summarizes the
1096content of a profile database. If you do:
1097
1098   \vspace{1ex}
1099   \user{\% hmmstat minifam}
1100   \vspace{1ex}
1101
1102you'll get:
1103
1104   \xsreoutput{inclusions/hmmstat-minifam.out}
1105
1106The output is one line per profile, numbered. Some of the fields are
1107more meaningful to you than others; some are sort of cryptic relics
1108of development that we haven't cleaned up yet:
1109
1110\begin{sreitems}{\monob{accession}}
1111
1112\item [\monob{idx}]       Number, in order in the database.
1113\item [\monob{name}]      Name of the profile.
1114\item [\monob{accession}] Accession (if present; else '-')
1115\item [\monob{nseq}]      Number of sequences in the alignment this
1116  profile was built from.
1117
1118\item [\monob{eff\_nseq}]  Effective sequence number.
1119   This was the ``effective'' number of independent sequences that
1120   \mono{hmmbuild's} default ``entropy weighting'' step decided on,
1121   given the phylogenetic similarity of the \mono{nseq} sequences in
1122   the input alignment.
1123
1124\item [\monob{M}] Length of the profile in consensus residues (match states).
1125
1126\item [\monob{relent}] Mean relative entropy of the match state
1127  emission probabilities, relative to default null background
1128  frequencies, in bits. This is the average bit score per aligned
1129  consensus residue. This quantity is the target of \mono{hmmbuild}'s
1130  entropy weighting procedure for determining \monob{eff\_nseq}.
1131
1132\item [\monob{info}] Mean information content per match state emission
1133  probability vector, in bits. Probably not useful to you. Information
1134  content is just a slightly different calculation from
1135  \monob{relent}.
1136
1137\item [\monob{p relE}] Mean positional relative entropy, in bits.
1138  Also probably not useful to you. This is an average relative entropy
1139  per position that takes into account the transition
1140  (insertion/deletion) probabilities.  It should be a more accurate
1141  estimation of the average bit score contributed per aligned model
1142  consensus position.
1143
1144\item [\monob{compKL}] Kullback-Leibler (KL) divergence from the
1145  average composition of the profile's consensus match states to the
1146  default background frequency distribution, in bits.  The higher this
1147  number, the more biased the residue composition of the profile
1148  is. Highly biased profiles may produce more false positives in
1149  searches, and can also slow the HMMER3 acceleration pipeline, by
1150  causing too many nonhomologous sequences to pass the filters.
1151
1152\end{sreitems}
1153
1154
1155%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1156\section{Creating multiple alignments with hmmalign}
1157
1158The file \mono{globins45.fa} is a FASTA file containing 45
1159unaligned globin sequences. To align all of these to the
1160\mono{globins4} profile and make a multiple sequence alignment:
1161
1162   \vspace{1ex}
1163   \user{\% hmmalign globins4.hmm globins45.fa}
1164   \vspace{1ex}
1165
1166The output of this is a Stockholm format multiple alignment file. The
1167first few lines of it look like:
1168
1169   \xsreoutput{inclusions/hmmalign-globins.out}
1170
1171and so on. (I've truncated long lines.)
1172
1173First thing to notice here is that \mono{hmmalign} uses both lower
1174case and upper case residues, and it uses two different characters for
1175gaps.  This is because there are two different kinds of columns:
1176``match'' columns in which residues are assigned to match states and
1177gaps are treated as deletions relative to consensus, and ``insert''
1178columns where residues are assigned to insert states and gaps in other
1179sequences are just padding for the alignment to accomodate those
1180insertions. In a match column, residues are upper case, and a '-'
1181character means a deletion relative to the consensus. In an insert
1182column, residues are lower case, and a '.' is padding.  A '-' deletion
1183has a cost: transition probabilities were assessed, penalizing the
1184transition into and out of a deletion. A '.' pad has no cost per se;
1185instead, the sequence(s) with insertions are paying transition
1186probabilities into and out of their inserted residue.
1187
1188This notation is only for your convenience in output files. You can
1189see the structure of the profile reflected in the pattern of
1190residues and gap characters \marginnote{By default, \mono{hmmalign}
1191  removes any columns that are all deletion characters, so the number
1192  of apparent match columns in a displayed alignment is $\leq$ the
1193  actual number of match states in the profile. To prevent this
1194  trimming and see columns for all match states, use the
1195  \mono{-{}-allcol} option. This can be helpful if you're writing a
1196  postprocessor that's trying to keep track of what columns are
1197  assigned to what match states in the profile.}.  In input files, in
1198most alignment formats\sidenote[][1ex]{A2M format is an important exception!} HMMER is
1199case-insensitive, and it does not distinguish between different gap
1200characters: '-' (dash), '.' (period), or even '\_' (underscore) are
1201accepted as gap characters.
1202
1203Important: insertions relative to a profile are
1204\emph{unaligned}. Suppose one sequence has an insertion of length 10
1205and another has an insertion of length 2 in the same place in the
1206profile. The alignment will have ten insert columns, to accomodate the
1207longest insertion.  The residues of the shorter insertion are thrown
1208down in an arbitrary order.\marginnote[2ex]{By arbitrary HMMER convention,
1209  the insertion is divided in half; half is left-justified, and the
1210  other half is right-justified, leaving '.' characters in the
1211  middle.}  Notice that in the previous paragraph I oh-so-carefully
1212said residues are ``assigned'' to a state, not ``aligned'' to a
1213state. For match states, assigned and aligned are the same thing: a
1214one-to-one correspondence between a residue and a consensus match
1215state in the profile. But there may be one \emph{or more} residues
1216assigned to the same insert state.
1217
1218Don't be confused by the unaligned nature of profile
1219insertions. You're sure to see cases where lower-case inserted
1220residues are ``obviously misaligned''.  This is just because HMMER
1221isn't trying to ``align'' them in the first place. It's assigning
1222them to unaligned insertions.
1223
1224Enough about the sequences in the alignment. Now, notice all those
1225\mono{PP} annotation lines. That's posterior probability annotation,
1226as in the single sequence alignments that \mono{hmmscan} and
1227\mono{hmmsearch} showed. This represents the confidence that each
1228residue is assigned where it should be.
1229
1230Again, that's ``assigned'', not ``aligned''. The posterior probability
1231assigned to an inserted residue is the probability that it is assigned
1232to the insert state corresponding to that column. Because the same
1233insert state might correspond to more than one column, the probability
1234on an insert residue is \emph{not} the probability that it belongs in
1235that particular column; again, if there's a choice of column for
1236putting an inserted residue, that choice is arbitrary.
1237
1238\mono{hmmalign} currently has a, um, feature that you may dislike.
1239Recall that HMMER only does local alignments. Here, we know that we've
1240provided full length globin sequences, and \mono{globins4} is a full
1241length globin profile. We'd probably like \mono{hmmalign} to produce a
1242global alignment. It can't currently do that. If it doesn't quite
1243manage to extend its local alignment to the full length of a target
1244globin sequence, you'll get a weird-looking effect, as the nonmatching
1245termini are pulled out to the left or right. For example, look at the
1246N-terminal \mono{g} in \mono{MYG\_HORSE} above. HMMER is about 80\%
1247confident that this residue is nonhomologous, though any sensible
1248person would align it into the first globin consensus column.
1249
1250Look at the end of that first block of Stockholm alignment, where you'll
1251see:
1252
1253   \xsreoutput{inclusions/hmmalign-globins.out}
1254
1255The \mono{\#=GC PP\_cons} line is Stockholm-format \emph{consensus
1256  posterior probability} annotation for the entire column. It's the
1257arithmetic mean of the per-residue posterior probabilities in that
1258column. This should prove useful in phylogenetic inference
1259applications, for example, where it's common to mask away
1260nonconfidently aligned columns of a multiple alignment. The
1261\mono{PP\_cons} line provides an objective measure of the confidence
1262assigned to each column.
1263
1264The \mono{\#=GC RF} line is Stockholm-format \emph{reference
1265  coordinate annotation}, with an x marking each column that the
1266profile considered to be consensus.
1267
1268
1269
1270%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1271\section{Searching DNA sequences}
1272
1273HMMER was originally developed for protein sequence analysis. The
1274\mono{hmmsearch} and \mono{hmmscan} programs assume that it's sensible
1275to ask if the entire target sequence is homologous (or not) to a query
1276profile. It makes sense to say ``this sequence is a probable protein
1277kinase'' because we find a protein kinase domain in it.  What if you
1278want to use a DNA profile to search a very long (chromosome-sized)
1279piece of DNA for homologous regions?  We might want to identify Alu
1280and L1 elements in human chromosome sequences, for example. It's not
1281super useful to see the 24 chromosomes ranked by E-values in
1282\mono{hmmsearch} output -- we're only interested in the element
1283locations. Also, if we can avoid having to align the entire target
1284chromosome sequence at once, we can scan the profile along the target
1285sequence in a much more memory-efficient manner than
1286\mono{hmmsearch}/\mono{hmmscan} would do.
1287
1288The \mono{nhmmer} and \mono{nhmmscan} programs are designed for
1289memory-efficient DNA profile searches of long DNA sequences. They were
1290developed in concert with the Dfam database (\url{dfam.org}), which
1291provides of alignments and profiles of many common DNA repeat elements
1292for several important genomes.  The alignment
1293\mono{tutorial/MADE1.sto} is a representative alignment of 100 human
1294MADE1 transposable elements, a subset of the Dfam MADE1
1295alignment. We'll use the MADE1 alignment to show how
1296\mono{nhmmer}/\mono{nhmmscan} work, which are similar to
1297\mono{hmmsearch}/\mono{hmmscan}.
1298
1299\subsection{Step 1: build a profile with hmmbuild}
1300
1301\mono{hmmbuild} works for both protein and DNA profiles, so:
1302
1303   \vspace{1ex}
1304   \user{\% hmmbuild MADE1.hmm MADE1.sto}
1305   \vspace{1ex}
1306
1307and you'll see some output that looks like:
1308
1309   \xsreoutput{inclusions/hmmbuild-made1.out}
1310
1311
1312Notice the new output column with the header ``W'', which is only
1313present when the input sequence alignment is DNA/RNA. This represents
1314an upper bound on the length at which nhmmer expects to find an
1315instance of the family\marginnote{W is based on position-specific insert
1316  rates: only $10^{-7}$ of all sequences generated from the profile
1317  are expected to be longer than W.}.  It is always larger than mlen,
1318though the ratio of mlen to W depends on the observed insert rate in
1319the seed alignment. This length is used deep in the acceleration
1320pipeline, and modest changes are not expected to impact results, but
1321larger values of W do lead to longer run time. The value can be
1322overridden with the \mono{-{}-w\_length} or \mono{-{}-w\_beta} flags, at
1323the risk of possibly missing instances of the family that happen to be
1324longer than W due to plentiful insertions.
1325
1326
1327
1328\subsection{Step 2: search the DNA sequence database with nhmmer}
1329
1330We'll use \mono{dna\_target.fa} as the target sequence
1331database. It is a FASTA format file containing one 330Kb long DNA
1332sequence extracted from human chromosome 1.
1333
1334\mono{nhmmer} accepts a target DNA sequence database in the same
1335formats as hmmsearch (typically FASTA). For the query, it accepts
1336either a profile file as produced above by hmmbuild, or a file
1337containing either one DNA sequence or an alignment of multiple DNA
1338sequences.
1339
1340If a sequence or alignment is used as query input, \mono{nhmmer}
1341internally produces the profile for that alignment\marginnote{Using
1342  default hmmbuild parameters; if you want more control, explicitly
1343  built the profile with hmmbuild.}, then searches with that
1344profile. The profile produced in this way is automatically saved to
1345disk, and the default file name is chosen by appending ``.hmm'' to the
1346name of the sequence file name. This can be overridden with the
1347\mono{-{}-hmmout} flag.
1348
1349To search \mono{dna\_target.fa} with our \mono{MADE1.hmm} profile:
1350
1351   \vspace{1ex}
1352   \user{\% nhmmer MADE1.hmm dna\_target.fa}
1353   \vspace{1ex}
1354
1355This output is largely similar to that of hmmsearch. The key
1356difference is that each hit is not to a full sequence in the target
1357database, but one local alignment of the profile to a subsequence of a
1358target database sequence.
1359
1360The first section is the \emph{header} that tells you what program you
1361ran, on what, and with what options, as above.
1362
1363The second section is the \emph{top hits} list. It is a list
1364of ranked top hits (sorted by E-value, most significant hit first),
1365formatted much like the \mono{hmmsearch} output \emph{top hits} list:
1366
1367   \xsreoutput{inclusions/nhmmer-made1.out}
1368
1369For each hit, the table shows its \emph{E-value}, \emph{bit score},
1370\emph{bias}, \emph{target sequence name} and \emph{target sequence
1371  description}, much like \mono{hmmsearch}.
1372
1373The ``start'' and ``end'' columns are the coordinates in the target
1374sequence where the hit is found. When the ``end'' is smaller than
1375``start'', this means the hit found on the reverse complement of the
1376target database sequence.\marginnote{\mono{nhmmer} automatically
1377  searches both strands.}
1378
1379For example, note that the top hits here are coming in overlapping
1380pairs, corresponding to the forward and reverse strands, like the hit
1381to \NMHafrom{}..\NMHato on the forward strand and a hit to
1382\NMHbfrom{}..\NMHbto{} on the reverse strand. This is because the
1383MADE1 DNA element is a near-perfect palindrome.\marginnote{DNA
1384  elements that have a unique orientation will only hit on one strand.
1385  \mono{nhmmer} treats the two strands independently. Palindromic
1386  elements will hit the same region on both strands and \mono{nhmmer}
1387  will not filter the overlapping hits.}
1388
1389Then comes the third output section, which starts with
1390
1391   \xsreoutput{inclusions/nhmmer-made1.out2}
1392
1393For each hit in the top hits list, there is a tabular line
1394providing detailed information about the hit, followed by the alignment
1395inferred for the hit. The first \mono{MADE1} hit
1396looks like:
1397
1398   \xsreoutput{inclusions/nhmmer-made1.out3}
1399
1400All these pieces of information are as described for \mono{hmmsearch},
1401plus a column for ``sq len'' that indicates the full length of the
1402target sequence.
1403
1404Under each one-line hit table is displayed the alignment inferred
1405between the profile and the hit envelope. For example, the top hit
1406from above is shown as:
1407
1408   \xsreoutput{inclusions/nhmmer-made1.out4}
1409
1410The alignment format is the same as for \mono{hmmsearch}.
1411
1412At the end of the output, you'll see summary statistics:
1413
1414   \xsreoutput{inclusions/nhmmer-made1.out5}
1415
1416This gives you some idea of what's going on in nhmmer's acceleration
1417pipeline. You've got one query profile, and \NMHnres{} residues were
1418searched (there are \NMHntop{} bases in the single sequence found in
1419the file; the search includes the reverse complement, doubling the
1420search space). The sequences in the database go through a gauntlet of
1421three scoring algorithms called SSV, Viterbi, and Forward, in order of
1422increasing sensitivity and increasing computational requirement.
1423
1424SSV (the ``Single ungapped Segment Viterbi'' algorithm) as used in
1425nhmmer is closely related to the MSV algorithm used in
1426\mono{hmmsearch}, in that it depends on ungapped alignment
1427segments. The difference lies in how those alignments are used. Using
1428MSV, a sequence is either rejected or accepted in its entirety. In the
1429scanning-SSV filter of \mono{nhmmer}, each sequence in the database is
1430scanned for high-scoring ungapped alignment segments, and a window
1431around each such segment is extracted (merging overlapping windows),
1432and passed on to the next stage. By default, nhmmer is configured to
1433allow sequence segments with a P-value of $\leq 0.02$ through the SSV
1434score filter.\sidenote{Thus, if the database contained no homologs and P-values
1435were accurately calculated, the highest scoring 2\% of the sequence
1436will pass the filter.} Here, \NMHnssv{} bases,
1437or about \NMHfracssv{}\% of the database, got through the SSV filter.
1438
1439The ``bias filter'' is then applied, as in \mono{hmmsearch}. Here,
1440\NMHnbias{} bases, roughly \NMHfracbias{}\% of the database
1441pass through the bias filter.
1442
1443The Viterbi filter then calculates a gapped optimal alignment score
1444for each window that survived the earlier stages. This score is a
1445closer approximation than the SSV score of the final score that the
1446window will achieve if it survives to final processing, but the
1447Viterbi filter is about four-fold slower than SSV. By default, nhmmer
1448lets windows with a P-value of $\leq 0.001$ through this stage. Here,
1449\NMHnvit{} bases, about \NMHfracvit{}\% of the database gets through.
1450
1451Then the full Forward score is calculated, which sums over all
1452possible alignments of the profile to the window. The default allows
1453windows with a P-value of $\leq 10^{-5}$ through; \NMHnfwd{} bases
1454passed.
1455
1456All windows that make it through these filters are then subjected to a
1457full probabilistic analysis using the HMM Forward/Backward algorithms,
1458to identify hit envelopes, then determine posterior probabilities for
1459each aligned residue, followed by optimal accuracy alignment. The
1460results of this step are what you finally see on the output. The final
1461number of hits and fractional coverage of the database is shown
1462next. This is typically smaller than the fraction of the database
1463passing the Forward filter, as hit identification typically trims
1464windows down to a smaller envelope.
1465
1466Finally, nhmmer reports the speed of the search in units of Mc/sec
1467(million dynamic programming cells per second), the CPU time, and the
1468elapsed time.
1469
1470\mono{nhmmscan} is to \mono{hmmscan} as \mono{nhmmer} is to
1471\mono{hmmsearch}.  There is not currently a iterative DNA search
1472analog to \mono{jackhmmer}.
1473
1474
1475
1476
1477
1478
1479
1480