1 2\chapter{Tutorial} 3\label{chapter:tutorial} 4\setcounter{footnote}{0} 5 6 7 8First let's do something useful and see it work, then we'll do a 9complete walkthrough. 10 11%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 12\section{Tap, tap; is this thing on?} 13 14In the \mono{tutorial} subdirectory, \mono{globins4.sto} is an example 15of a basic Stockholm alignment file. \mono{hmmbuild} builds a profile 16from an alignment:\marginnote{\mono{hmmbuild} needs to be installed in 17 your \mono{PATH} to be able to just type the \mono{hmmbuild} command 18 like this. Otherwise you need to give a path to where 19 \mono{hmmbuild} is, which might be \mono{src/hmmbuild}, if you're in 20 the HMMER top level distribution directory. If you're new to how 21 paths to programs and files work on the command line, skip 22 ahead to \hyperref[section:running]{running a HMMER program} for 23 some more detail.} 24 25 \vspace{1ex} 26 \user{\% cd tutorial} \\ 27 \user{\% hmmbuild globins4.hmm globins4.sto} 28 \vspace{1ex} 29 30\mono{hmmsearch} searches a profile against a sequence database. The 31file \mono{tutorial/globins45.fa} is a small example of a FASTA file 32containing 45 globin sequences: 33 34 \vspace{1ex} 35 \user{\% hmmsearch globins4.hmm globins45.fa} 36 \vspace{1ex} 37 38This will print an output of the search results, with a table of 39significant hits followed by their alignments. 40 41That's all you need to start using HMMER for work. You can build 42a profile of your favorite sequence alignment if you have one; you can 43also obtain alignments and profiles from 44Pfam.\sidenote{\href{http://pfam.xfam.org}{pfam.xfam.org}} You can 45obtain real sequence databases to search from 46NCBI\sidenote{\href{ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz}{ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz}} 47or 48UniProt\sidenote{\href{http://www.uniprot.org/downloads}{www.uniprot.org/downloads}}. 49You don't have to worry much about sequence file formats. HMMER can 50read most common alignment and sequence file formats 51automatically. 52 53 54 55 56%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 57\section {The programs in HMMER} 58 59In rough order of importance, the 18 HMMER programs are: 60 61\vspace{1ex} 62\begin{tabular}{rp{4in}} 63\monob{hmmbuild} & build profile from input multiple alignment\\ 64\monob{hmmalign} & make multiple sequence alignment using a profile\\ 65\monob{hmmsearch} & search profile against sequence database\\ 66\monob{hmmscan} & search sequence against profile database\\ 67\monob{hmmpress} & prepare profile database for \mono{hmmscan}\\ 68\monob{phmmer} & search single sequence against sequence database\\ 69\monob{jackhmmer} & iteratively search single sequence against database\\ 70\monob{nhmmer} & search DNA query against DNA sequence database\\ 71\monob{nhmmscan} & search DNA sequence against a DNA profile database\\ 72\monob{hmmfetch} & retrieve profile(s) from a profile file \\ 73\monob{hmmstat} & show summary statistics for a profile file \\ 74\monob{hmmemit} & generate (sample) sequences from a profile \\ 75\monob{hmmlogo} & produce a conservation logo graphic from a profile\\ 76\monob{hmmconvert} & convert between different profile file formats \\ 77\monob{hmmpgmd} & search daemon for the \mono{hmmer.org} website \\ 78\monob{hmmpgmd\_shard} & sharded search daemon for the \mono{hmmer.org} website \\ 79\monob{makehmmerdb} & prepare an \mono{nhmmer} binary database \\ 80\monob{hmmsim} & collect score distributions on random sequences\\ 81\monob{alimask} & add column mask to a multiple sequence alignment \\ 82\end{tabular} 83\vspace{1ex} 84 85 86The programs \mono{hmmbuild}, \mono{hmmsearch}, \mono{hmmscan}, and 87\mono{hmmalign} are the core functionality for protein domain analysis 88and annotation pipelines, for instance using profile databases like 89Pfam. 90 91The \mono{phmmer} and \mono{jackhmmer} programs search a single 92protein sequence against a protein sequence database, akin to BLASTP 93and PSI-BLAST. (Internally, they just produce a profile from the 94query sequence, then run profile searches.) 95 96\mono{nhmmer} is the equivalent of \mono{hmmsearch} and \mono{phmmer}, 97but is capable of searching long, chromosome-size target DNA 98sequences. \mono{nhmmscan} is the equivalent of \mono{hmmscan}, 99capable of using chromosome-size DNA sequences as a query into a 100profile database. 101\marginnote{\mono{nhmmer} and \mono{nhmmscan} were added in HMMER3.1.} 102 103 104 105 106%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 107\section{Running a HMMER program} 108\label{section:running} 109 110After you compile HMMER, these programs are in the \mono{src/} 111subdirectory of the top-level HMMER directory. If you run them without 112arguments, they will give you a brief help message.\marginnote{If you 113 run a HMMER program with a \mono{-h} option and no arguments, it 114 will give you a brief summary of its usage and options.} In this 115chapter, I will assume that you have installed them (with \mono{make 116 install}, perhaps) so they're in your \mono{PATH}. So if you type 117\mono{hmmbuild} at the command line, you see: 118 119 \vspace{1ex} 120 \user{\% hmmbuild} 121 \vspace{-1ex} 122 \xsreoutput{inclusions/hmmbuild-noargs.out} 123 \vspace{-1ex} 124 125If you haven't installed the HMMER programs, you need to specify both 126the program name and a path to it. This tutorial chapter assumes 127that you're in the \mono{tutorial} subdirectory, where the tutorial 128example data files are. From \mono{tutorial} , the relative 129path to the compiled programs is \mono{../src/}. So instead of just 130typing \mono{hmmbuild}, you could do: \marginnote{The \mono{\%} 131 represents your command prompt. It's not part of what you type.} 132 133 \vspace{1ex} 134 \user{\% ../src/hmmbuild} 135 \vspace{1ex} 136 137Make sure you can run a HMMER program like this before moving on. If 138you are stuck getting something like \mono{hmmbuild: command not 139 found}, the unix shell isn't finding the program in your \mono{PATH} 140or you're not giving a correct explicit path. Consult your shell's 141documentation, or a friendly local unix guru. 142 143\section{Files used in the tutorial} 144 145The subdirectory called \mono{tutorial} in the HMMER distribution 146contains the files used in the tutorial. If you haven't already, 147change into that directory now. 148 149 \vspace{1ex} 150 \user{\% cd tutorial} 151 \vspace{1ex} 152 153If you do a \mono{ls}, you'll see there are several example files in 154the \mono{tutorial} directory: 155 156\begin{sreitems}{\monob{dna\_target.fa}} 157\item[\monob{globins4.sto}] An example alignment of four globin sequences, in 158 Stockholm format. This alignment is a subset of a famous old 159 published structural alignment from Don Bashford.\cite{Bashford87} 160% 161\item[\monob{globins45.fa}] 45 unaligned globin sequences, in FASTA 162 format. 163% 164\item[\monob{HBB\_HUMAN}] A FASTA file containing the sequence of 165 human $\beta-$hemoglobin. 166% 167\item[\monob{fn3.sto}] An example alignment of fibronectin type III 168 domains. This is a Pfam fn3 seed alignment, in Stockholm format. 169% 170\item[\monob{Pkinase.sto}] The Pfam Pkinase seed alignment of protein 171 kinase domains. 172% 173\item[\monob{7LESS\_DROME}] A FASTA file containing the sequence of 174 the \emph{Drosophila} Sevenless protein, a receptor tyrosine kinase 175 whose extracellular region consists of an array of several 176 fibronectin type III domains. 177% 178\item[\monob{MADE1.sto}] An example DNA alignment, a subset 179of the Dfam seed alignment for the MADE1 transposable element family. 180% 181\item[\monob{dna\_target.fa}] A 330Kb sequence from human chromosome 182 1 in FASTA format, containing four MADE1 elements. 183\end{sreitems} 184 185 186%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 187\section{On sequence file formats, briefly} 188 189Input files for HMMER include unaligned sequence files and multiple 190sequence alignment files. Sequence files and alignment files can come 191in many different poorly standardized formats. 192 193A commonly used format for unaligned sequence files and sequence 194databases is FASTA format. Several of the tutorial files give you 195examples (\mono{globins45.fa}, for example). 196 197HMMER's preferred alignment file format is Stockholm format, which is 198also the format that Pfam alignments are in. Stockholm allows 199detailed annotation of columns, residues, and sequences, and HMMER is 200built to use this annotation.\marginnote{Stockholm format was 201 developed jointly with us by the Pfam curation team.} Stockholm also 202allows a file to contain many alignments for many families (a multiple 203multiple alignment file?). \mono{globins4.sto} is a simple example, 204and \mono{fn3.sto} is an example with a lot of annotation markup. 205 206HMMER can read several other sequence and alignment file formats. By 207default, it autodetects what format an input file is in. Accepted 208unaligned sequence file formats include \mono{fasta}, \mono{uniprot}, 209\mono{genbank}, \mono{ddbj}, and \mono{embl}. Accepted multiple 210alignment file formats include \mono{stockholm}, \mono{afa} 211(i.e.\ aligned FASTA), \mono{clustal}, \mono{clustallike} (MUSCLE, 212etc.), \mono{a2m}, \mono{phylip} (interleaved), \mono{phylips} 213(sequential), \mono{psiblast}, and \mono{selex}. These formats are 214described in detail in a later chapter. Where applicable, the programs 215have command line options for asserting an input format and skipping 216autodetection, when you don't want to depend on it. 217 218HMMER also automatically detects whether a sequence or alignment file 219contains nucleotide or protein sequence data. Like format 220autodetection, alphabet autodetection sometimes doesn't work on weird 221files. Where applicable, the programs have options (usually 222\mono{-{}-rna}, \mono{-{}-dna}, \mono{-{}-amino}) for asserting the 223input alphabet type. 224 225For more information in HMMER input files and formats, see the formats 226chapter on page \pageref{chapter:formats}. 227 228 229%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 230\section{Searching a sequence database with a profile} 231 232Now let's go through the \mono{hmmbuild}/\mono{hmmsearch} example in a 233bit more detail. 234 235\subsection{Step 1: build a profile with \mono{hmmbuild}} 236 237The file \mono{globins4.sto} looks like this: 238 239 \xsreoutput{inclusions/globins4.sto} 240 241Most popular alignment formats are similar block-based formats. They 242can be turned into Stockholm format with a little editing or 243scripting. Don't forget the \mono{\# STOCKHOLM 1.0} line at the start 244of the alignment, nor the \mono{//} at the end. 245 246Stockholm alignments can be concatenated to create an alignment 247database flatfile containing many alignments. The Pfam database, for 248example, distributes a single file containing representative 249alignments for thousands of sequence families.\marginnote{The Easel 250 miniapps provide tools for manipulating alignment files, such as 251 \mono{esl-afetch} for extracting one alignment by name or accession 252 from a Pfam file.} 253 254You ran \mono{hmmbuild} to build a profile from that 255alignment: 256 257 \vspace{1ex} 258 \user{\% hmmbuild globins4.hmm globins4.sto} 259 \vspace{1ex} 260 261and you got some output that looks like: 262 263 \xsreoutput{inclusions/hmmbuild-globins.out} 264 265If your input file had contained more than one alignment, you'd get 266one line of output for each profile. A single \mono{hmmbuild} command 267suffices to turn a Pfam seed alignment flatfile (such as 268\mono{Pfam-A.seed}) into a profile flatfile (such as 269\mono{Pfam.hmm}). 270 271The information on these lines is almost self-explanatory. The 272\mono{globins4} alignment consisted of \BGLnseq{} sequences with 273\BGLalen{} aligned columns (\mono{alen}). HMMER turned it into a profile 274of \BGLmlen{} consensus positions (\mono{mlen}), which means it 275defined \BGLgaps{} gap-containing alignment columns to be insertions 276relative to consensus. The \BGLnseq{} sequences were only counted as 277an ``effective'' total sequence number (\mono{eff\_nseq}) of 278\BGLeffn{}, because they're fairly similar to each 279other.\sidenote{It's not unusual for this number to be less than 1. 280 I'll explain why later.} 281The profile ended up with a relative entropy per position 282(\mono{re/pos}; average score, or information content) of \BGLre{} 283bits. 284 285The new profile was saved to \mono{globins4.hmm}. If you were to look at 286this file (and you don't have to -- it's intended for HMMER's 287consumption, not yours), you'd see something like: 288 289 \xsreoutput{inclusions/hmmbuild-globins.out2} 290 291The HMMER ASCII save file format is defined on 292page~\pageref{section:savefiles}. 293 294 295 296\subsection{Step 2: search the sequence database with hmmsearch} 297 298Presumably you have a sequence database to search. Here we'll use a 299UniProtKB/Swiss-Prot FASTA format flatfile (not provided in the 300tutorial, because of its large size), \mono{uniprot\_sprot.fasta}. If 301you don't have a sequence database handy, run your example search 302against \mono{globins45.fa} instead, which is a FASTA format file 303containing 45 globin sequences. 304 305\mono{hmmsearch} accepts any FASTA file as target database input. It 306also accepts EMBL/UniProtKB text format, and Genbank format. It will 307automatically determine what format your file is in; you don't have to 308say. An example of searching a sequence database with our 309\mono{globins4.hmm} profile would look like: 310 311 \vspace{1ex} 312 \user{\% hmmsearch globins4.hmm uniprot\_sprot.fasta > globins4.out} 313 \vspace{1ex} 314 315Have a look at the output, \mono{globins4.out}. The first section is 316the \emph{header} that tells you what program you ran, on what, and 317with what options: 318 319 \xsreoutput{inclusions/hmmsearch-globins.out} 320 321The second section is the \emph{sequence top hits} list. It is a list 322of ranked top hits (sorted by E-value, most significant hit first), 323formatted in a BLAST-like style: 324 325 \xsreoutput{inclusions/hmmsearch-globins.out2} 326 327The last two columns, obviously, are the name of each target sequence 328and optional description. The description line usually gets truncated 329just to keep line lengths down to something reasonable. If you want 330the full description line, and don't mind long output line lengths, 331use the \mono{-{}-notextw} option. 332 333The most important number here is the first one, the \emph{sequence 334 E-value}. The E-value is the expected number of false positives 335(nonhomologous sequences) that scored this well or better. The 336E-value is a measure of statistical significance. The lower the 337E-value, the more significant the hit. I typically consider 338sequences with E-values $< 10^{-3}$ or so to be significant hits. 339 340The E-value is based on the \emph{sequence bit score}, the second 341number. This is the log-odds score for the complete sequence. Some 342people like to see a bit score instead of an E-value, because the bit 343score doesn't depend on the size of the sequence database, only on the 344profile and the target sequence. The E-value does depend on the 345size of the database you search: if you search a database ten times 346larger, you get ten times the number of false positives. 347 348The next number, the \emph{bias}, is a correction term for biased 349sequence composition that was applied to the sequence bit 350score.\marginnote{The method that HMMER uses to compensate for biased 351 composition remains unpublished, shamefully.} 352For instance, for the top hit \mono{\SGUseqname{}} 353that scored \SGUbitscore{} bits, the bias of \SGUbias{} bits means 354that this sequence originally scored \SGUorigscore{} bits, which was adjusted by 355the slight \SGUbias{} bit biased-composition correction. The only time you 356really need to pay attention to the bias value is when it's large, on 357the same order of magnitude as the sequence bit score. Sometimes 358(rarely) the bias correction isn't aggressive enough, and allows a 359non-homolog to retain too much score. Conversely, the bias correction 360can be too aggressive sometimes, causing you to miss homologs. You can 361turn the biased-composition score correction off with the 362\mono{-{}-nonull2} option.\sidenote{And if you're doing that, you may also want 363to set \mono{-{}-nobias}, to turn off another biased composition step 364called the bias filter, which affects which sequences get scored at 365all.} 366 367The next three numbers are again an E-value, score, and bias, but only 368for the single best-scoring domain in the sequence, rather than the 369sum of all its identified domains. The rationale for this isn't 370apparent in the globin example, because all the globins in this 371example consist of only a single globin domain. So let's set up a 372second example, using a profile of a single domain that's commonly 373found in multiple domains in a single sequence. Build a fibronectin 374type III domain profile using the \mono{fn3.sto} 375alignment.\sidenote{This happens to be a Pfam seed alignment. It's a 376 good example of an alignment with complex Stockholm annotation.} 377Then use that profile to analyze the sequence \mono{7LESS\_DROME}, the 378\emph{Drosophila} Sevenless receptor tyrosine kinase: 379 380 \vspace{1ex} 381 \user{\% hmmbuild fn3.hmm fn3.sto} \\ 382 \user{\% hmmsearch fn3.hmm 7LESS\_DROME > fn3.out} 383 \vspace{1ex} 384 385In \mono{fn3.out}, the sequence top hits list says: 386 387 \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out} 388 389OK, now let's pick up the explanation where we left off. The total 390sequence score of \SFSbitscore{} sums up \emph{all} the fibronectin III domains 391that were found in the \mono{7LESS\_DROME} sequence. The ``single best 392dom'' score and E-value are the bit score and E-value as if the target 393sequence only contained the single best-scoring domain, without this 394summation. 395 396The idea is that we might be able to detect that a sequence is a 397member of a multidomain family because it contains multiple 398weakly-scoring domains, even if no single domain is solidly 399significant on its own. On the other hand, if the target sequence 400happened to be a piece of junk consisting of a set of identical 401internal repeats, and one of those repeats accidentially gives a weak 402hit to the query profile, all the repeats will sum up and the sequence 403score might look ``significant''.\sidenote{Mathematically, alas, the 404 correct answer: the null hypothesis we're testing against is that 405 the sequence is a \emph{random} sequence of some base composition, 406 and a repetitive sequence isn't random.} 407 408So operationally: 409\begin{itemize} 410\item if both E-values are significant ($<<1$), the sequence is likely 411 to be homologous to your query. 412\item if the full sequence E-value is significant but the single best domain 413 E-value is not, the target sequence is probably a multidomain remote 414 homolog; but be wary, and watch out for the case where it's just a repetitive 415 sequence. 416\end{itemize} 417 418OK, the sharp eyed reader asks, if that's so, then why in the globin4 419output (all of which have only a single domain) do the full sequence 420bit scores and best single domain bit scores not exactly agree? For 421example, the top ranked hit, \mono{\SGUseqname{}}, 422has a full sequence score of \SGUbitscore{} and a single 423best domain score of \SGUdombitscore{}. What's going on? What's going on is that 424the position and alignment of that domain is uncertain -- in this 425case, only very slightly so, but nonetheless uncertain. The full 426sequence score is summed over all possible alignments of the globin 427profile to the \mono{\SGUseqname{}} sequence. When HMMER identifies 428domains, it identifies what it calls an \textbf{envelope} bounding 429where the domain's alignment most probably lies.\marginnote{The 430 difference between envelopes and alignments comes up again below 431when we discuss the reported coordinates of domains and alignments in 432the next section of the output.} The ``single best dom'' score is 433calculated after the domain envelope has been defined, and the 434summation is restricted only to the ensemble of possible alignments 435that lie within the envelope. The fact that the two scores are 436slightly different is therefore telling you that there's a small 437amount of probability (uncertainty) that the domain lies somewhat 438outside the envelope bounds that HMMER has selected. 439 440The two columns headed \mono{\#dom} are two different estimates of 441the number of distinct domains that the target sequence contains. The 442first, the column marked \mono{exp}, is the \emph{expected} number of 443domains according to HMMER's statistical model. It's an average, 444calculated as a weighted marginal sum over all possible 445alignments. Because it's an average, it isn't necessarily a round 446integer. The second, the column marked \mono{N}, is the number of 447domains that HMMER's domain postprocessing and annotation pipeline 448finally decided to identify, annotate, and align in the target 449sequence. This is the number of alignments that will show up in the 450domain report later in the output file. 451 452These two numbers should be about the same. Rarely, you might see that 453they're very different, and this would usually be a sign that the 454target sequence is so highly repetitive that it's confused the HMMER 455domain postprocessor.\sidenote{Such sequences aren't likely to show up as 456significant homologs to any sensible query in the first place.} 457 458The sequence top hits output continues until it runs out of sequences 459to report. By default, the report includes all sequences with an 460E-value of 10.0 or less. It's showing you the top of the noise, so you 461can decide for yourself what's interesting or not: the default output 462is expected to contain about 10 false positives with E-values in the 463range of about 1-10. 464 465Then comes the third output section, which starts with 466 467 \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out2} 468 469Now for each sequence in the top hits list, there will be a section 470containing a table of where HMMER thinks all the domains are, 471followed by the alignment inferred for each domain. Let's use the 472\mono{fn3} vs. \mono{7LESS\_DROME} example, because it contains lots 473of domains, and is more interesting in this respect than the globin4 474output. The domain table for \mono{7LESS\_DROME} looks like: 475 476 \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out3} 477 478Domains are reported in the order they appear in the sequence, not in 479order of their significance. 480 481The \mono{!} or \mono{?} symbol indicates whether this domain does or 482does not satisfy both per-sequence and per-domain inclusion 483thresholds. Inclusion thresholds are used to determine what matches 484should be considered to be ``true'', as opposed to reporting 485thresholds that determine what matches will be reported.\sidenote{The 486 default reporting threshold of 10.0 is chosen so you get to see 487 about $\sim$10 insignificant hits at the top of the noise, so you 488 can see what interesting sequences might be getting tickled by your 489 search.} By default, inclusion thresholds usually require a 490per-sequence E value of 0.01 or less and a per-domain conditional 491E-value of 0.01 or less (except jackhmmer, which requires a more 492stringent 0.001 for both), and reporting E-value thresholds are set to 49310.0. 494 495The bit score and bias values are as described above for sequence 496scores, but are the score of just one domain's envelope. 497 498The first of the two E-values is the \textbf{conditional 499 E-value}.\marginnote{The conditional E-value is a weird statistic, 500 and it's not clear I'm going to keep it.} It is an attempt to 501measure the statistical significance of each domain, \emph{given that 502 we've already decided that the target sequence overall is a true 503 homolog}. It is the expected number of \emph{additional} domains 504we'd find with a domain score this big in the set of sequences 505reported in the top hits list, if those sequences consisted only of 506random nonhomologous sequence outside the best region that sufficed to 507define them as homologs. 508 509The second number is the \textbf{independent E-value}: the 510significance of the sequence in the \emph{whole} database search, if 511this were the only domain we had identified. It's exactly the same as 512the ``best 1 domain'' E-value in the sequence top hits list. 513 514The different between the two E-values is not apparent in the 515\mono{7LESS\_DROME} example because in both cases, the size of the 516search space as 1 sequence. There's a single sequence in the target 517sequence database (that's the size of the search space that the 518independent/best single domain E-value depends on). There's one 519sequence reported as a putative homolog in the sequence top hits list 520(that's the size of the search space that the conditional E-value 521depends on). A better example is to see what happens when we search 522UniProt (I used release \UNIrelease{}, which contains \UNInseq{} sequences) 523with the \mono{fn3} profile: 524 525 \vspace{1ex} 526 \user{\% hmmsearch fn3.hmm uniprot\_sprot.fasta} 527 \vspace{1ex} 528 529(If you don't have UniProt and can't run a command like this, don't 530worry about it - I'll show the relevant bits here.) Now the domain 531report for \mono{7LESS\_DROME} looks like: 532 533 \xsreoutput{inclusions/hmmsearch-fn3-uniprot.out} 534 535Notice that \emph{almost} everything's the same (it's the same target 536sequence, after all) \emph{except} for what happens with E-values. The 537independent E-value is calculated assuming a search space of all 538\UNInseq{} sequences. For example, look at the highest scoring domain 539(domain \SFSmaxdomu{} here; domain \SFSmaxdom{} above). When we only looked at a single 540sequence, its score of \SFSmaxsc{} bits has an E-value of \SFSievalue{}. When we 541search a database of \UNInseq{} sequences, a hit scoring \SFSmaxsc{} bits would 542be expected to happen \UNInseq{} times as often: \SFSievalue{} $\times$ \UNInseq{} 543$=$ \SFSuievalue{}. In this UniProt 544search, \SFSdomZ{} sequences were reported in the top hits list (with 545E-values $\leq 10$). If we were to assume that all \SFSdomZ{} are true 546homologs, x out the domain(s) that made us think that, and then went 547looking for \emph{additional} domains in those \SFSdomZ{} sequences, we'd be 548searching a smaller database of \SFSdomZ{} sequences: the expected number of 549times we'd see a hit of \SFSmaxsc{} bits or better is now \SFSievalue{} $\times$ 550\SFSdomZ{} $=$ \SFSucevalue.\marginnote{If you calculate this 551 yourself, you may see some small discrepancies 552due to floating point roundoff.} That's where the conditional E-value (c-Evalue) is 553coming from. 554 555Notice that a couple of domains disappeared in the UniProt search, 556because now, in this larger search space size, they're not 557significant. Domain \SFSaidx{} (the one with the score of \SFSascore{} 558bits) got a conditional E-value of \SFSaevalue{} $\times$ \SFSdomZ{} = 559\SFSauevalue{}, and domain \SFSbidx{} (with a bit score of 560\SFSbscore{}) got a c-Evalue of \SFSbevalue{} $\times$ \SFSdomZ = 561\SFSbuevalue{}. These fail the default reporting threshold of 56210.0. Also, the domains with scores of \SFSainsig{} and \SFSbinsig{} 563shifted from being above to below the default inclusion thresholds, so 564now they're marked with \mono{?} instead of \mono{!}. 565 566Operationally: 567 568\begin{itemize} 569\item If the independent E-value is significant ($<<1$), that means 570that even this single domain \emph{by itself} is such a strong hit 571that it suffices to identify the sequence as a significant homolog 572with respect to the size of the entire original database search. You 573can be confident that this is a homologous domain. 574 575\item Once there's one or more high-scoring domains in the sequence 576already, sufficient to decide that the sequence contains homologs of 577your query, you can look (with some caution) at the conditional 578E-value to decide the statistical significance of additional 579weak-scoring domains. 580\end{itemize} 581 582In the UniProt output, for example, we'd be pretty sure of four of the 583domains (1, 4, 5, and maybe 6), each of which has a strong enough 584independent E-value to declare \mono{7LESS\_DROME} to be an 585fnIII-domain-containing protein. Domain 2 wouldn't be significant if 586it was all we saw in the sequence, but once we decide that 587\mono{7LESS\_DROME} contains fn3 domains on the basis of the other 588hits, its conditional E-value indicates that it's probably an fn3 589domain too. Domains 3 and 7 (the ones marked by \mono{?}) are too weak 590to be sure of, from this search alone, but would be something to pay 591attention to. 592 593The next four columns give the endpoints of the reported local 594alignment with respect to both the query profile (\mono{hmmfrom} and 595\mono{hmm to}) and the target sequence (\mono{alifrom} and \mono{ali 596 to}). 597 598It's not immediately easy to tell from the ``to'' coordinate whether 599the alignment ended internally in the query or target, versus ran all 600the way (as in a full-length global alignment) to the end(s). To make 601this more readily apparent, with each pair of query and target 602endpoint coordinates, there's also a little symbology: \mono{..} 603means both ends of the alignment ended internally, \mono{[]} 604means both ends of the alignment were full-length flush to the ends of 605the query or target, and \mono{[.} and \mono{.]} mean only the left or 606right end was flush/full length. 607 608The next two columns (\mono{envfrom} and \mono{env to}) define the 609\emph{envelope} of the domain's location on the target sequence. The 610envelope is almost always a little wider than what HMMER chooses to 611show as a reasonably confident alignment. As mentioned earlier, the 612envelope represents a subsequence that encompasses most of the 613posterior probability for a given homologous domain, even if precise 614endpoints are only fuzzily inferrable.\marginnote{You'll notice that for higher 615scoring domains, the coordinates of the envelope and the inferred 616alignment will tend to be in tighter agreement, corresponding to 617sharper posterior probability defining the location of the homologous 618region.} 619 620Operationally, I recommend using the envelope coordinates to annotate 621domain locations on target sequences, not the alignment 622coordinates. Be aware that when two weaker-scoring domains are close 623to each other, envelope coordinates (and, rarely, sometimes even 624alignment coordinates) can and will overlap, corresponding to the 625overlapping uncertainty of where one domain ends and another begins. 626 627The last column is the average posterior probability of the aligned 628target sequence residues; effectively, the expected accuracy per 629residue of the alignment. 630 631For comparison, current UniProt consensus annotation of Sevenless 632shows seven domains: 633 634 \xsreoutput{inclusions/sevenless_domains.out} 635 636These agree (modulo differences in start/endpoints) with the seven 637strongest domains identified by HMMER. The weaker partial domain hits 638(at \SFSacoords{} and \SFSbcoords{}) are also plausible homologies, given that 639the extracellular domain of Sevenless is pretty much just a big array 640of $\sim$100aa fibronectin repeats. 641 642Under the domain table, an ``optimal posterior accuracy'' 643alignment\cite{Holmes98} is computed within each domain's envelope 644and displayed. For example, (skipping domain 1 because it's weak and 645unconvincing), fibronectin III domain 2 in your \mono{7LESS\_DROME} 646output is shown as: 647 648 \xsreoutput{inclusions/hmmsearch-fn3-sevenless.out4} 649 650The initial header line starts with a \mono{==} as a little handle for 651a parsing script to grab hold of. I may put more information on that line 652eventually. 653 654If the profile had any consensus structure or reference line annotation 655that it inherited from your multiple alignment (\mono{\#=GC SS\_cons}, 656\mono{\#=GC RF} annotation in Stockholm files), that information is 657simply regurgitated as \mono{CS} or \mono{RF} annotation lines 658here. The \mono{fn3} profile had a consensus structure annotation line. 659 660The line starting with \mono{fn3} is the consensus of the query 661profile: the residue with the highest emission probability at each 662position.\sidenote{For a single sequence query in \mono{phmmer}, the 663 consensus is the sequence itself. Oddly, this is not necessarily the 664 highest probability sequence; for example, under a BLOSUM62 scoring 665 matrix, where the query has an M, you are more likely to see an 666 aligned L, not an M.} Capital letters represent particularly 667highly conserved positions.\sidenote{For protein models, $\geq$ 50\% 668 emission probability; for DNA/DNA, $\geq$ 90\%.} 669Dots (\mono{.}) in this line indicate insertions 670in the target sequence with respect to the profile. 671 672The midline indicates matches between the query profile and target 673sequence. A letter indicates an exact match to the profile consensus. 674A \mono{+} indicates that this residue has a positive log-odds 675emission score, a ``conservative substitution'' given what the profile 676expects at that position.\sidenote{That is, the emission probability 677 $e(a)$ for this aligned residue $a$ is $> f_a$, its background 678 frequency: it's a likely residue, just not the most likely one.} 679 680The line starting with \mono{7LESS\_DROME} is the target sequence. 681Dashes (\mono{-}) in this line indicate deletions in the target 682sequence with respect to the profile. 683 684The bottom line represents the posterior 685probability (essentially the expected accuracy) of each aligned 686residue. A 0 means 0-5\%, 1 means 5-15\%, and so on; 9 means 85-95\%, 687and a \mono{*} means 95-100\% posterior probability. You can use these 688posterior probabilities to decide which parts of the alignment are 689well-determined or not. You'll often observe, for example, that 690expected alignment accuracy degrades around locations of insertion and 691deletion, which you'd intuitively expect. 692 693You'll also see expected alignment accuracy degrade at the ends of an 694alignment -- this is because ``alignment accuracy'' posterior 695probabilities currently not only includes whether the residue is 696aligned to one profile position versus others, but also confounded with 697whether a residue should be considered to be homologous (aligned to 698the profile somewhere) versus not homologous at all.\marginnote{It may 699make more sense to condition the posterior probabilities on the 700assumption that the residue is indeed homologous: given that, how 701likely is it that we've got it correctly aligned.} 702 703 704These domain table and per-domain alignment reports for each sequence 705then continue, for each sequence that was in the per-sequence top hits 706list. 707 708Finally, at the bottom of the file, you'll see some summary 709statistics. For example, at the bottom of the globins search output, 710you'll find something like: 711 712 \xsreoutput{inclusions/hmmsearch-globins.out3} 713 714This gives you some idea of what's going on in HMMER's acceleration 715pipeline. You've got one query profile, and the database has \UNInseq{} 716target sequences. Each sequence goes through a gauntlet of three 717scoring algorithms called MSV, Viterbi, and Forward, in order of 718increasing sensitivity and increasing computational requirement. 719 720MSV (the ``Multi ungapped Segment Viterbi'' algorithm) essentially 721calculates the HMMER equivalent of BLAST's sum score -- an optimal sum 722of ungapped high-scoring alignment segments. Unlike BLAST, it does 723this calculation directly, without BLAST's word hit or hit extension 724step, using a SIMD vector-parallel algorithm. By default, HMMER is 725configured to allow sequences with a P-value of $\leq 0.02$ through 726the MSV score filter.\sidenote{Thus, if the database contained no homologs and 727P-values were accurately calculated, the highest scoring 2\% of the 728sequences will pass the filter.} Here, for this globin search, about 729\SGUmsvpass{}\% of the database got through the MSV filter. 730 731A quick check is then done to see if the target sequence is 732``obviously'' so biased in its composition that it's unlikely to be a 733true homolog. This is called the ``bias filter''. If you don't like it 734(it can occasionally be overaggressive) you can shut it off with the 735\mono{-{}-nobias} option. Here, \SGUbiaspass{} sequences pass through the bias 736filter. 737 738The Viterbi filter then calculates a gapped optimal alignment score. 739This is more sensitive than the MSV score, but the Viterbi filter is 740about four-fold slower than MSV. By default, HMMER lets sequences 741with a P-value of $\leq 0.001$ through this stage. Here, because 742there's about a thousand true globin homologs in this database, more 743than that gets through: \SGUvitpass{} sequences. 744 745Then the full Forward score is calculated, which sums over all 746possible alignments of the profile to the target sequence. The default 747allows sequences with a P-value of $\leq 10^{-5}$ through: \SGUfwdpass{} 748sequences pass. 749 750All sequences that make it through the three filters are then 751subjected to a full probabilistic analysis using the HMM 752Forward/Backward algorithms, first to identify domains and assign 753domain envelopes; then within each individual domain envelope, 754Forward/Backward calculations are done to determine posterior 755probabilities for each aligned residue, followed by optimal accuracy 756alignment. The results of this step are what you finally see on the 757output. 758 759Recall the difference between conditional and independent E-values, 760with their two different search space sizes. These search space sizes 761are reported in the statistics summary. 762 763Finally, it reports the speed of the search in units of Mc/sec 764(million dynamic programming cells per second), the CPU time, and the 765elapsed time. This search took about \SGUelapsed{} seconds of elapsed 766(wall clock) time. 767 768 769 770%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 771\section{Single sequence protein queries using phmmer} 772 773The \mono{phmmer} program is for searching a single sequence query 774against a sequence database, much as BLASTP or FASTA would 775do. \mono{phmmer} works essentially just like \mono{hmmsearch} does, 776except you provide a query sequence instead of a query profile. 777 778Internally, HMMER builds a profile from your single query 779sequence, using a simple position-independent scoring system (BLOSUM62 780scores converted to probabilities, plus a gap-open and gap-extend 781probability). 782 783The file \mono{tutorial/HBB\_HUMAN} is a FASTA file containing the 784human $\beta-$globin sequence as an example query. If you have a 785sequence database such as \mono{uniprot\_sprot.fasta}, make that your 786target database; otherwise, use \mono{tutorial/globins45.fa} as a 787small example: 788 789 \vspace{1ex} 790 \user{\% phmmer HBB\_HUMAN uniprot\_sprot.fasta} 791 \vspace{1ex} 792 793or 794 795 \vspace{1ex} 796 \user{\% phmmer HBB\_HUMAN globins45.fa} 797 \vspace{1ex} 798 799Everything about the output is essentially as previously described for 800\mono{hmmsearch}. 801 802 803 804 805%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 806\section{Iterative protein searches using jackhmmer} 807 808The \mono{jackhmmer} program is for searching a single sequence query 809iteratively against a sequence database, much as PSI-BLAST would do. 810 811The first round is identical to a \mono{phmmer} search. All the 812matches that pass the inclusion thresholds are put in a multiple 813alignment. In the second (and subsequent) rounds, a profile is made 814from these results, and the database is searched again with the 815profile. 816 817Iterations continue either until no new sequences are detected or the 818maximum number of iterations is reached. By default, the maximum 819number of iterations is 5; you can change this with the \mono{-N} 820option. 821 822Your original query sequence is always included in the multiple 823alignments, whether or not it appears in the database.\marginnote{If it 824 \emph{is} in the database, it will almost certainly be included in 825 the internal multiple alignment twice, once because it's the query 826 and once because it's a significant database match to itself. This 827 redundancy won't screw anything up, because sequences are 828 downweighted for redundancy anyway.} 829The ``consensus'' columns assigned to each multiple alignment always 830correspond exactly to the residues of your query, so the coordinate 831system of every profile is always the same as the numbering of 832residues in your query sequence, 1..L for a sequence of length L. 833 834Assuming you have UniProt or something like it handy, here's an 835example command line for a jackhmmer search: 836 837 \vspace{1ex} 838 \user{\% jackhmmer HBB\_HUMAN uniprot\_sprot.fasta} 839 \vspace{1ex} 840 841One difference from \mono{phmmer} output you'll notice is that 842\mono{jackhmmer} marks ``new'' sequences with a \mono{+} and ``lost'' 843sequences with a \mono{-}. New sequences are sequences that pass the 844inclusion threshold(s) in this round, but didn't in the round before. 845Lost sequences are the opposite: sequences that passed the inclusion 846threshold(s) in the previous round, but have now fallen beneath (yet 847are still in the reported hits -- it's possible, though rare, to lose 848sequences utterly, if they no longer even pass the reporting 849threshold(s)). In the first round, everything above the inclusion 850thresholds is marked with a \mono{+}, and nothing is marked with a 851\mono{-}. For example, the top of this output looks like: 852 853 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out} 854 855That continues until the inclusion threshold is reached, at which 856point you see a tagline ``inclusion threshold'' indicating where the 857threshold was set: 858 859 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out2} 860 861The domain output and search statistics are then shown just as in 862\mono{phmmer}. At the end of this first iteration, you'll see some 863output that starts with \mono{@@} (this is a simple tag that lets you 864search through the file to find the end of one iteration and the 865beginning of another): 866 867 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out3} 868 869This (obviously) is telling you that the new alignment contains \JHUninc{} 870sequences, your query plus \JHUnsig{} significant matches. For round two, 871it's built a new profile from this alignment. Now for round two, it 872fires off what's essentially an \mono{hmmsearch} of the target 873database with this new profile: 874 875 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out4} 876 877If you skim down through this output, you'll start seeing newly 878included sequences marked with \mono{+}'s, such as: 879 880 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out5} 881 882It's less usual to see sequences get lost (and marked with \mono{-}), 883but it happens. 884 885After round 2, more distant globin sequences have been found: 886 887 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out6} 888 889Because new sequences were included, it keeps going to round three, 890and then again to round four. After round four, the search ends 891because it didn't find any new hits; it considers the search to be 892``converged'' and it stops. (It would also eventually stop at a 893certain maximum number of iterations; the default maximum is 5, and 894you can set a different maximum with the \mono{-N} option.) The end 895of the output is: 896 897 \xsreoutput{inclusions/jackhmmer-hbb-uniprot.out7} 898 899The \mono{//} marks the end of the results for one query. You could 900search with more than one query in your input query sequence 901file. 902 903There is an \mono{[ok]} at the end of the search output as a signal 904that the search successfully completed. This might be useful if you're 905automating lots of searches and you want to be sure that they worked. 906 907 908 909%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 910\section{Searching a profile database with a query sequence} 911 912Rather than searching a single profile against a collection of 913sequences, you might want to wish to annotate a single sequence by 914searching it against a collection of profiles of different domains. 915\mono{hmmscan} takes as input a query file containing one or more 916sequences to annotate, and a profile database to search them against. 917The profile database might be Pfam, SMART, or TIGRFams, for example, or 918another collection of your choice. \marginnote{Either \mono{hmmsearch} 919 or \mono{hmmscan} can compare a set of profiles to a set of 920 sequences. Due to disk access patterns of the two tools, it is 921 usually more efficient to use \mono{hmmsearch}, unless the number of 922 profiles greatly exceeds the number of sequences.} 923 924\subsection{Step 1: create a profile database file} 925 926A profile ``database'' file is just a concatenation of individual 927profile files. To create a database file, you can either build 928individual profile files and concatenate them, or you can concatenate 929Stockholm alignments and use \mono{hmmbuild} to build a profile database 930from them in one command. 931 932Let's create a tiny database called \mono{minifam} containing profiles 933of globin, fn3, and Pkinase (protein kinase) domains by concatenating 934profile files: 935 936 \vspace{1ex} 937 \user{\% hmmbuild globins4.hmm globins4.sto}\\ 938 \user{\% hmmbuild fn3.hmm fn3.sto}\\ 939 \user{\% hmmbuild Pkinase.hmm Pkinase.sto}\\ 940 \user{\% cat globins4.hmm fn3.hmm Pkinase.hmm > minifam} 941 \vspace{1ex} 942 943We'll use \mono{minifam} for our examples in just a bit, but first a 944few words on other ways to build profile databases, especially big ones. 945 946The other way to do it is to start with an \emph{alignment} database 947flatfile -- a concatenation of many Stockholm files -- and use 948\mono{hmmbuild} to build a profile database file from it. For 949example, you could obtain the big \mono{Pfam-A.seed} and/or 950\mono{Pfam-A.full} Stockholm-format alignment flatfiles from Pfam. 951\mono{hmmbuild} names each profile according to a \mono{\#=GF ID} 952annotation line in each Stockholm alignment. Normally the \mono{ID} 953line is optional in Stockholm format, but \mono{hmmbuild} has to name 954your new profile(s) somehow. For a single alignment, it will use your 955filename, or you can use the \mono{hmmbuild -n <name>} option to 956provide a name yourself. For an alignment database, the only way 957\mono{hmmbuild} can get a name for each alignment is from alignment 958annotation.\marginnote{For example, it won't work if you concatenate 959 \mono{globins4.sto} with other Stockholm files, because the simple 960 little \mono{globins4.sto} alignment doesn't have an \mono{ID} 961 line.} Of alignment file formats, only Stockholm format provides a 962way to concatenate many alignments in the same file, with a name for 963each alignment. For example, from a Pfam seed alignment flatfile 964\mono{Pfam-A.seed}, you can do: 965 966 \vspace{1ex} 967 \user{\% hmmbuild Pfam-A.hmm Pfam-A.seed} 968 \vspace{1ex} 969 970This would probably take a couple of hours to build all 20,000 profiles 971or so in Pfam. To speed the database construction process up, 972\mono{hmmbuild} supports MPI parallelization. Running MPI programs can 973be a little arcane, so skip this bit if you're not in the mood. 974 975As far as HMMER's concerned, all you have to do is add \mono{-{}-mpi} 976to the command line for \mono{hmmbuild} to tell it to run in MPI 977master/worker mode across many cores and/or machines, assuming you've 978compiled support for MPI into it (see the installation instructions). 979You'll also need to know how to invoke an MPI job in your particular 980cluster environment, with your job scheduler and your MPI 981distribution.\sidenote{I can't really help you with this. Different sites 982have different cluster, scheduler, and MPI environments. Consult a 983local guru, as they say.} In general, you will launch the parallel 984\mono{hmmbuild} by using a command like \mono{mpirun} or \mono{srun} 985that manages the MPI environment for a specified number of processes. 986With the SGE (Sun Grid Engine) scheduler and Intel MPI, an example 987incantation for building \mono{Pfam.hmm} from \mono{Pfam-A.seed} in 988parallel across 128 processes: 989 990 \vspace{1ex} 991 \user{\% qsub -N hmmbuild -j y -o errors.out -b y -cwd -V -pe impi 128 \textbackslash}\\ 992 \user{ 'mpirun -np 128 ./hmmbuild --mpi Pfam.hmm Pfam-A.seed > hmmbuild.out'} 993 \vspace{1ex} 994 995or, an example SLURM incantation (on the \mono{eddy} group partition 996on our Harvard cluster): 997 998 \vspace{1ex} 999 \begin{fullwidth} 1000 \user{\indent \% sbatch -J hmmbuild -e hmmbuild.err -o hmmbuild.out -p eddy -n 128 -t 6-00:00 --mem-per-cpu=4000 \textbackslash}\\ 1001 \user{\indent --wrap="srun -n 128 --mpi=pmi2 ./hmmbuild --mpi Pfam-A.hmm Pfam-A.seed"} 1002 \end{fullwidth} 1003 \vspace{1em} 1004 1005This reduces the time to build all of Pfam to about 40 seconds. 1006 1007 1008 1009\subsection{Step 2: compress and index the flatfile with hmmpress} 1010 1011\mono{hmmscan} has to read a lot of profiles in a hurry, and 1012HMMER's text flatfiles are bulky. To accelerate this, \mono{hmmscan} 1013depends on binary compression and indexing of the flatfiles. First 1014you compress and index your profile database with the \mono{hmmpress} 1015program: 1016 1017 \vspace{1ex} 1018 \user{\% hmmpress minifam} 1019 \vspace{1ex} 1020 1021This will produce: 1022 1023 \xsreoutput{inclusions/hmmpress-minifam.out} 1024 1025and you'll see these four new binary files in the directory.\sidenote{ 1026Their format is ``proprietary'', an open source term of art that means 1027both ``I haven't found time to document them yet'' and ``I still might 1028decide to change them arbitrarily without telling you''.} 1029 1030\subsection{Step 3: search the profile database with hmmscan} 1031 1032Now we can analyze sequences using our profile database and 1033\mono{hmmscan}. 1034 1035For example, the receptor tyrosine kinase \mono{7LESS\_DROME} not only 1036has all those fibronectin type III domains on its extracellular side, 1037it's got a protein kinase domain on its intracellular side. Our 1038\mono{minifam} database has profiles of both \mono{fn3} and 1039\mono{Pkinase}, as well as the unrelated \mono{globins4} profile. So 1040what happens when we scan the \mono{7LESS\_DROME} sequence: 1041 1042 \vspace{1ex} 1043 \user{\% hmmscan minifam 7LESS\_DROME} 1044 \vspace{1ex} 1045 1046The header and the first section of the output will look like: 1047 1048 \xsreoutput{inclusions/hmmscan-minifam-sevenless.out} 1049 1050The output fields are in the same order and have the same meaning as 1051in \mono{hmmsearch}'s output. 1052 1053The size of the search space for \mono{hmmscan} is the number of 1054profiles in the profile database (here, 3; for a Pfam search, on the order 1055of 20000). In \mono{hmmsearch}, the size of the search space is the 1056number of sequences in the sequence database. This means that E-values 1057may differ even for the same individual profile vs. sequence 1058comparison, depending on how you do the search. 1059 1060For domain, there then follows a domain table and alignment output, 1061just as in \mono{hmmsearch}. The \mono{fn3} annotation, for example, 1062looks like: 1063 1064 \xsreoutput{inclusions/hmmscan-minifam-sevenless.out2} 1065 1066and an example alignment (of that second domain again): 1067 1068 \xsreoutput{inclusions/hmmscan-minifam-sevenless.out3} 1069 1070You'd probably expect that except for the E-values (which depend on 1071database search space sizes), you should get exactly the same scores, 1072domain number, domain coordinates, and alignment every time you do a 1073comparison of the same profile against the same sequence. Which is 1074actually the case! But in fact, under the hood, it's actually not so 1075obvious this should be, and HMMER is actually going out of its way to 1076make it so. HMMER uses stochastic sampling algorithms to infer some 1077parameters, and also to infer the exact domain number and domain 1078boundaries in certain difficult cases. If HMMER ran its stochastic 1079samples ``properly'', it would obtain different samples every time you 1080ran a program, and all of you would complain to me that HMMER was 1081weird and buggy because it gave different answers on the same 1082problem. To suppress run-to-run variation, HMMER seeds its random 1083number generator(s) \emph{identically} every time you do a sequence 1084comparison. If you're a stats expert, and you really want to see the 1085proper stochastic variation that results from sampling algorithms, you 1086can pass a command-line argument of \mono{-{}-seed 0} to programs that 1087have this property (hmmbuild and the four search programs). 1088 1089 1090 1091\subsection{Summary statistics for a profile database: hmmstat} 1092 1093Our \mono{minifam} profile ``database'' example only contains three 1094profiles, but real profile databases like Pfam can contain many 1095thousands. The \mono{hmmstat} program is a utility that summarizes the 1096content of a profile database. If you do: 1097 1098 \vspace{1ex} 1099 \user{\% hmmstat minifam} 1100 \vspace{1ex} 1101 1102you'll get: 1103 1104 \xsreoutput{inclusions/hmmstat-minifam.out} 1105 1106The output is one line per profile, numbered. Some of the fields are 1107more meaningful to you than others; some are sort of cryptic relics 1108of development that we haven't cleaned up yet: 1109 1110\begin{sreitems}{\monob{accession}} 1111 1112\item [\monob{idx}] Number, in order in the database. 1113\item [\monob{name}] Name of the profile. 1114\item [\monob{accession}] Accession (if present; else '-') 1115\item [\monob{nseq}] Number of sequences in the alignment this 1116 profile was built from. 1117 1118\item [\monob{eff\_nseq}] Effective sequence number. 1119 This was the ``effective'' number of independent sequences that 1120 \mono{hmmbuild's} default ``entropy weighting'' step decided on, 1121 given the phylogenetic similarity of the \mono{nseq} sequences in 1122 the input alignment. 1123 1124\item [\monob{M}] Length of the profile in consensus residues (match states). 1125 1126\item [\monob{relent}] Mean relative entropy of the match state 1127 emission probabilities, relative to default null background 1128 frequencies, in bits. This is the average bit score per aligned 1129 consensus residue. This quantity is the target of \mono{hmmbuild}'s 1130 entropy weighting procedure for determining \monob{eff\_nseq}. 1131 1132\item [\monob{info}] Mean information content per match state emission 1133 probability vector, in bits. Probably not useful to you. Information 1134 content is just a slightly different calculation from 1135 \monob{relent}. 1136 1137\item [\monob{p relE}] Mean positional relative entropy, in bits. 1138 Also probably not useful to you. This is an average relative entropy 1139 per position that takes into account the transition 1140 (insertion/deletion) probabilities. It should be a more accurate 1141 estimation of the average bit score contributed per aligned model 1142 consensus position. 1143 1144\item [\monob{compKL}] Kullback-Leibler (KL) divergence from the 1145 average composition of the profile's consensus match states to the 1146 default background frequency distribution, in bits. The higher this 1147 number, the more biased the residue composition of the profile 1148 is. Highly biased profiles may produce more false positives in 1149 searches, and can also slow the HMMER3 acceleration pipeline, by 1150 causing too many nonhomologous sequences to pass the filters. 1151 1152\end{sreitems} 1153 1154 1155%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1156\section{Creating multiple alignments with hmmalign} 1157 1158The file \mono{globins45.fa} is a FASTA file containing 45 1159unaligned globin sequences. To align all of these to the 1160\mono{globins4} profile and make a multiple sequence alignment: 1161 1162 \vspace{1ex} 1163 \user{\% hmmalign globins4.hmm globins45.fa} 1164 \vspace{1ex} 1165 1166The output of this is a Stockholm format multiple alignment file. The 1167first few lines of it look like: 1168 1169 \xsreoutput{inclusions/hmmalign-globins.out} 1170 1171and so on. (I've truncated long lines.) 1172 1173First thing to notice here is that \mono{hmmalign} uses both lower 1174case and upper case residues, and it uses two different characters for 1175gaps. This is because there are two different kinds of columns: 1176``match'' columns in which residues are assigned to match states and 1177gaps are treated as deletions relative to consensus, and ``insert'' 1178columns where residues are assigned to insert states and gaps in other 1179sequences are just padding for the alignment to accomodate those 1180insertions. In a match column, residues are upper case, and a '-' 1181character means a deletion relative to the consensus. In an insert 1182column, residues are lower case, and a '.' is padding. A '-' deletion 1183has a cost: transition probabilities were assessed, penalizing the 1184transition into and out of a deletion. A '.' pad has no cost per se; 1185instead, the sequence(s) with insertions are paying transition 1186probabilities into and out of their inserted residue. 1187 1188This notation is only for your convenience in output files. You can 1189see the structure of the profile reflected in the pattern of 1190residues and gap characters \marginnote{By default, \mono{hmmalign} 1191 removes any columns that are all deletion characters, so the number 1192 of apparent match columns in a displayed alignment is $\leq$ the 1193 actual number of match states in the profile. To prevent this 1194 trimming and see columns for all match states, use the 1195 \mono{-{}-allcol} option. This can be helpful if you're writing a 1196 postprocessor that's trying to keep track of what columns are 1197 assigned to what match states in the profile.}. In input files, in 1198most alignment formats\sidenote[][1ex]{A2M format is an important exception!} HMMER is 1199case-insensitive, and it does not distinguish between different gap 1200characters: '-' (dash), '.' (period), or even '\_' (underscore) are 1201accepted as gap characters. 1202 1203Important: insertions relative to a profile are 1204\emph{unaligned}. Suppose one sequence has an insertion of length 10 1205and another has an insertion of length 2 in the same place in the 1206profile. The alignment will have ten insert columns, to accomodate the 1207longest insertion. The residues of the shorter insertion are thrown 1208down in an arbitrary order.\marginnote[2ex]{By arbitrary HMMER convention, 1209 the insertion is divided in half; half is left-justified, and the 1210 other half is right-justified, leaving '.' characters in the 1211 middle.} Notice that in the previous paragraph I oh-so-carefully 1212said residues are ``assigned'' to a state, not ``aligned'' to a 1213state. For match states, assigned and aligned are the same thing: a 1214one-to-one correspondence between a residue and a consensus match 1215state in the profile. But there may be one \emph{or more} residues 1216assigned to the same insert state. 1217 1218Don't be confused by the unaligned nature of profile 1219insertions. You're sure to see cases where lower-case inserted 1220residues are ``obviously misaligned''. This is just because HMMER 1221isn't trying to ``align'' them in the first place. It's assigning 1222them to unaligned insertions. 1223 1224Enough about the sequences in the alignment. Now, notice all those 1225\mono{PP} annotation lines. That's posterior probability annotation, 1226as in the single sequence alignments that \mono{hmmscan} and 1227\mono{hmmsearch} showed. This represents the confidence that each 1228residue is assigned where it should be. 1229 1230Again, that's ``assigned'', not ``aligned''. The posterior probability 1231assigned to an inserted residue is the probability that it is assigned 1232to the insert state corresponding to that column. Because the same 1233insert state might correspond to more than one column, the probability 1234on an insert residue is \emph{not} the probability that it belongs in 1235that particular column; again, if there's a choice of column for 1236putting an inserted residue, that choice is arbitrary. 1237 1238\mono{hmmalign} currently has a, um, feature that you may dislike. 1239Recall that HMMER only does local alignments. Here, we know that we've 1240provided full length globin sequences, and \mono{globins4} is a full 1241length globin profile. We'd probably like \mono{hmmalign} to produce a 1242global alignment. It can't currently do that. If it doesn't quite 1243manage to extend its local alignment to the full length of a target 1244globin sequence, you'll get a weird-looking effect, as the nonmatching 1245termini are pulled out to the left or right. For example, look at the 1246N-terminal \mono{g} in \mono{MYG\_HORSE} above. HMMER is about 80\% 1247confident that this residue is nonhomologous, though any sensible 1248person would align it into the first globin consensus column. 1249 1250Look at the end of that first block of Stockholm alignment, where you'll 1251see: 1252 1253 \xsreoutput{inclusions/hmmalign-globins.out} 1254 1255The \mono{\#=GC PP\_cons} line is Stockholm-format \emph{consensus 1256 posterior probability} annotation for the entire column. It's the 1257arithmetic mean of the per-residue posterior probabilities in that 1258column. This should prove useful in phylogenetic inference 1259applications, for example, where it's common to mask away 1260nonconfidently aligned columns of a multiple alignment. The 1261\mono{PP\_cons} line provides an objective measure of the confidence 1262assigned to each column. 1263 1264The \mono{\#=GC RF} line is Stockholm-format \emph{reference 1265 coordinate annotation}, with an x marking each column that the 1266profile considered to be consensus. 1267 1268 1269 1270%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1271\section{Searching DNA sequences} 1272 1273HMMER was originally developed for protein sequence analysis. The 1274\mono{hmmsearch} and \mono{hmmscan} programs assume that it's sensible 1275to ask if the entire target sequence is homologous (or not) to a query 1276profile. It makes sense to say ``this sequence is a probable protein 1277kinase'' because we find a protein kinase domain in it. What if you 1278want to use a DNA profile to search a very long (chromosome-sized) 1279piece of DNA for homologous regions? We might want to identify Alu 1280and L1 elements in human chromosome sequences, for example. It's not 1281super useful to see the 24 chromosomes ranked by E-values in 1282\mono{hmmsearch} output -- we're only interested in the element 1283locations. Also, if we can avoid having to align the entire target 1284chromosome sequence at once, we can scan the profile along the target 1285sequence in a much more memory-efficient manner than 1286\mono{hmmsearch}/\mono{hmmscan} would do. 1287 1288The \mono{nhmmer} and \mono{nhmmscan} programs are designed for 1289memory-efficient DNA profile searches of long DNA sequences. They were 1290developed in concert with the Dfam database (\url{dfam.org}), which 1291provides of alignments and profiles of many common DNA repeat elements 1292for several important genomes. The alignment 1293\mono{tutorial/MADE1.sto} is a representative alignment of 100 human 1294MADE1 transposable elements, a subset of the Dfam MADE1 1295alignment. We'll use the MADE1 alignment to show how 1296\mono{nhmmer}/\mono{nhmmscan} work, which are similar to 1297\mono{hmmsearch}/\mono{hmmscan}. 1298 1299\subsection{Step 1: build a profile with hmmbuild} 1300 1301\mono{hmmbuild} works for both protein and DNA profiles, so: 1302 1303 \vspace{1ex} 1304 \user{\% hmmbuild MADE1.hmm MADE1.sto} 1305 \vspace{1ex} 1306 1307and you'll see some output that looks like: 1308 1309 \xsreoutput{inclusions/hmmbuild-made1.out} 1310 1311 1312Notice the new output column with the header ``W'', which is only 1313present when the input sequence alignment is DNA/RNA. This represents 1314an upper bound on the length at which nhmmer expects to find an 1315instance of the family\marginnote{W is based on position-specific insert 1316 rates: only $10^{-7}$ of all sequences generated from the profile 1317 are expected to be longer than W.}. It is always larger than mlen, 1318though the ratio of mlen to W depends on the observed insert rate in 1319the seed alignment. This length is used deep in the acceleration 1320pipeline, and modest changes are not expected to impact results, but 1321larger values of W do lead to longer run time. The value can be 1322overridden with the \mono{-{}-w\_length} or \mono{-{}-w\_beta} flags, at 1323the risk of possibly missing instances of the family that happen to be 1324longer than W due to plentiful insertions. 1325 1326 1327 1328\subsection{Step 2: search the DNA sequence database with nhmmer} 1329 1330We'll use \mono{dna\_target.fa} as the target sequence 1331database. It is a FASTA format file containing one 330Kb long DNA 1332sequence extracted from human chromosome 1. 1333 1334\mono{nhmmer} accepts a target DNA sequence database in the same 1335formats as hmmsearch (typically FASTA). For the query, it accepts 1336either a profile file as produced above by hmmbuild, or a file 1337containing either one DNA sequence or an alignment of multiple DNA 1338sequences. 1339 1340If a sequence or alignment is used as query input, \mono{nhmmer} 1341internally produces the profile for that alignment\marginnote{Using 1342 default hmmbuild parameters; if you want more control, explicitly 1343 built the profile with hmmbuild.}, then searches with that 1344profile. The profile produced in this way is automatically saved to 1345disk, and the default file name is chosen by appending ``.hmm'' to the 1346name of the sequence file name. This can be overridden with the 1347\mono{-{}-hmmout} flag. 1348 1349To search \mono{dna\_target.fa} with our \mono{MADE1.hmm} profile: 1350 1351 \vspace{1ex} 1352 \user{\% nhmmer MADE1.hmm dna\_target.fa} 1353 \vspace{1ex} 1354 1355This output is largely similar to that of hmmsearch. The key 1356difference is that each hit is not to a full sequence in the target 1357database, but one local alignment of the profile to a subsequence of a 1358target database sequence. 1359 1360The first section is the \emph{header} that tells you what program you 1361ran, on what, and with what options, as above. 1362 1363The second section is the \emph{top hits} list. It is a list 1364of ranked top hits (sorted by E-value, most significant hit first), 1365formatted much like the \mono{hmmsearch} output \emph{top hits} list: 1366 1367 \xsreoutput{inclusions/nhmmer-made1.out} 1368 1369For each hit, the table shows its \emph{E-value}, \emph{bit score}, 1370\emph{bias}, \emph{target sequence name} and \emph{target sequence 1371 description}, much like \mono{hmmsearch}. 1372 1373The ``start'' and ``end'' columns are the coordinates in the target 1374sequence where the hit is found. When the ``end'' is smaller than 1375``start'', this means the hit found on the reverse complement of the 1376target database sequence.\marginnote{\mono{nhmmer} automatically 1377 searches both strands.} 1378 1379For example, note that the top hits here are coming in overlapping 1380pairs, corresponding to the forward and reverse strands, like the hit 1381to \NMHafrom{}..\NMHato on the forward strand and a hit to 1382\NMHbfrom{}..\NMHbto{} on the reverse strand. This is because the 1383MADE1 DNA element is a near-perfect palindrome.\marginnote{DNA 1384 elements that have a unique orientation will only hit on one strand. 1385 \mono{nhmmer} treats the two strands independently. Palindromic 1386 elements will hit the same region on both strands and \mono{nhmmer} 1387 will not filter the overlapping hits.} 1388 1389Then comes the third output section, which starts with 1390 1391 \xsreoutput{inclusions/nhmmer-made1.out2} 1392 1393For each hit in the top hits list, there is a tabular line 1394providing detailed information about the hit, followed by the alignment 1395inferred for the hit. The first \mono{MADE1} hit 1396looks like: 1397 1398 \xsreoutput{inclusions/nhmmer-made1.out3} 1399 1400All these pieces of information are as described for \mono{hmmsearch}, 1401plus a column for ``sq len'' that indicates the full length of the 1402target sequence. 1403 1404Under each one-line hit table is displayed the alignment inferred 1405between the profile and the hit envelope. For example, the top hit 1406from above is shown as: 1407 1408 \xsreoutput{inclusions/nhmmer-made1.out4} 1409 1410The alignment format is the same as for \mono{hmmsearch}. 1411 1412At the end of the output, you'll see summary statistics: 1413 1414 \xsreoutput{inclusions/nhmmer-made1.out5} 1415 1416This gives you some idea of what's going on in nhmmer's acceleration 1417pipeline. You've got one query profile, and \NMHnres{} residues were 1418searched (there are \NMHntop{} bases in the single sequence found in 1419the file; the search includes the reverse complement, doubling the 1420search space). The sequences in the database go through a gauntlet of 1421three scoring algorithms called SSV, Viterbi, and Forward, in order of 1422increasing sensitivity and increasing computational requirement. 1423 1424SSV (the ``Single ungapped Segment Viterbi'' algorithm) as used in 1425nhmmer is closely related to the MSV algorithm used in 1426\mono{hmmsearch}, in that it depends on ungapped alignment 1427segments. The difference lies in how those alignments are used. Using 1428MSV, a sequence is either rejected or accepted in its entirety. In the 1429scanning-SSV filter of \mono{nhmmer}, each sequence in the database is 1430scanned for high-scoring ungapped alignment segments, and a window 1431around each such segment is extracted (merging overlapping windows), 1432and passed on to the next stage. By default, nhmmer is configured to 1433allow sequence segments with a P-value of $\leq 0.02$ through the SSV 1434score filter.\sidenote{Thus, if the database contained no homologs and P-values 1435were accurately calculated, the highest scoring 2\% of the sequence 1436will pass the filter.} Here, \NMHnssv{} bases, 1437or about \NMHfracssv{}\% of the database, got through the SSV filter. 1438 1439The ``bias filter'' is then applied, as in \mono{hmmsearch}. Here, 1440\NMHnbias{} bases, roughly \NMHfracbias{}\% of the database 1441pass through the bias filter. 1442 1443The Viterbi filter then calculates a gapped optimal alignment score 1444for each window that survived the earlier stages. This score is a 1445closer approximation than the SSV score of the final score that the 1446window will achieve if it survives to final processing, but the 1447Viterbi filter is about four-fold slower than SSV. By default, nhmmer 1448lets windows with a P-value of $\leq 0.001$ through this stage. Here, 1449\NMHnvit{} bases, about \NMHfracvit{}\% of the database gets through. 1450 1451Then the full Forward score is calculated, which sums over all 1452possible alignments of the profile to the window. The default allows 1453windows with a P-value of $\leq 10^{-5}$ through; \NMHnfwd{} bases 1454passed. 1455 1456All windows that make it through these filters are then subjected to a 1457full probabilistic analysis using the HMM Forward/Backward algorithms, 1458to identify hit envelopes, then determine posterior probabilities for 1459each aligned residue, followed by optimal accuracy alignment. The 1460results of this step are what you finally see on the output. The final 1461number of hits and fractional coverage of the database is shown 1462next. This is typically smaller than the fraction of the database 1463passing the Forward filter, as hit identification typically trims 1464windows down to a smaller envelope. 1465 1466Finally, nhmmer reports the speed of the search in units of Mc/sec 1467(million dynamic programming cells per second), the CPU time, and the 1468elapsed time. 1469 1470\mono{nhmmscan} is to \mono{hmmscan} as \mono{nhmmer} is to 1471\mono{hmmsearch}. There is not currently a iterative DNA search 1472analog to \mono{jackhmmer}. 1473 1474 1475 1476 1477 1478 1479 1480