1\chapter{Tabular output formats}
2\label{chapter:tabular}
3\setcounter{footnote}{0}
4
5\section{The target hits table}
6
7The \mono{-{}-tblout} output option produces the \emph{target hits
8  table}.  The target hits table consists of one line for each
9different query/target comparison that met the reporting thresholds,
10ranked by decreasing statistical significance (increasing E-value).
11
12
13\paragraph{tblout fields for protein search programs}
14
15In the protein search programs, each line consists of \textbf{18
16space-delimited fields} followed by a free text target sequence description, as
17follows:\marginnote{The \mono{tblout} format is deliberately space-delimited
18(rather than tab-delimited) and justified into aligned columns, so these files
19  are suitable both for automated parsing and for human
20  examination. I feel that tab-delimited data files are difficult for humans to
21  examine and spot check. For this reason, I think tab-delimited
22  files are a minor evil in the world. Although I occasionally
23  receive shrieks of outrage about this, I still stubbornly feel that
24  space-delimited files are just as easily parsed as tab-delimited
25  files.}
26
27\begin{description}
28\item[\monob{(1) target name:}]
29  The name of the target sequence or profile.
30
31\item[\monob{(2) accession:}]
32  The accession of the target sequence or profile, or '-' if none.
33
34\item[\monob{(3) query name:}]
35  The name of the query sequence or profile.
36
37\item[\monob{(4) accession:}]
38  The accession of the query sequence or profile, or '-' if none.
39
40\item[\monob{(5) E-value (full sequence):}] The expectation value
41  (statistical significance) of the target.  This is a \emph{per
42  query} E-value; i.e.\ calculated as the expected number of false
43  positives achieving this comparison's score for a \emph{single}
44  query against the $Z$ sequences in the target dataset.  If you
45  search with multiple queries and if you want to control the
46  \emph{overall} false positive rate of that search rather than the
47  false positive rate per query, you will want to multiply this
48  per-query E-value by how many queries you're doing.
49
50\item[\monob{(6) score (full sequence):}]
51  The score (in bits) for this target/query comparison. It includes
52  the biased-composition correction (the ``null2'' model).
53
54\item[\monob{(7) Bias (full sequence):}] The biased-composition
55  correction: the bit score difference contributed by the null2
56  model. High bias scores may be a red flag for a false positive,
57  especially when the bias score is as large or larger than the
58  overall bit score. It is difficult to correct for all possible ways
59  in which a nonrandom but nonhomologous biological sequences can
60  appear to be similar, such as short-period tandem repeats, so there
61  are cases where the bias correction is not strong enough (creating
62  false positives).
63
64\item[\monob{(8) E-value (best 1 domain):}] The E-value if only the
65  single best-scoring domain envelope were found in the sequence, and
66  none of the others. If this E-value isn't good, but the full
67  sequence E-value is good, this is a potential red flag.  Weak hits,
68  none of which are good enough on their own, are summing up to lift
69  the sequence up to a high score. Whether this is Good or Bad is not
70  clear; the sequence may contain several weak homologous domains, or
71  it might contain a repetitive sequence that is hitting by chance
72  (i.e. once one repeat hits, all the repeats hit).
73
74\item[\monob{(9) score (best 1 domain):}]  The bit score if only the
75  single best-scoring domain envelope were found in the sequence, and
76  none of the others. (Inclusive of the null2 bias correction.]
77
78\item[\monob{(10) bias (best 1 domain):}] The null2 bias correction
79  that was applied to the bit score of the single best-scoring domain.
80
81\item[\monob{(11) exp:}] Expected number of domains, as calculated by
82  posterior decoding on the mean number of begin states used in the
83  alignment ensemble.
84
85\item[\monob{(12) reg:}] Number of discrete regions defined, as
86  calculated by heuristics applied to posterior decoding of begin/end
87  state positions in the alignment ensemble.  The number of regions
88  will generally be close to the expected number of domains. The more
89  different the two numbers are, the less discrete the regions appear
90  to be, in terms of probability mass. This usually means one of two
91  things. On the one hand, weak homologous domains may be difficult
92  for the heuristics to identify clearly. On the other hand,
93  repetitive sequence may appear to have a high expected domain number
94  (from lots of crappy possible alignments in the ensemble, no one of
95  which is very convincing on its own, so no one region is discretely
96  well-defined).
97
98\item[\monob{(13) clu:}] Number of regions that appeared to be
99  multidomain, and therefore were passed to stochastic traceback
100  clustering for further resolution down to one or more
101  envelopes. This number is often zero.
102
103\item[\monob{(14) ov:}] For envelopes that were defined by stochastic
104  traceback clustering, how many of them overlap other envelopes.
105
106\item[\monob{(15) env:}]
107  The total number of envelopes defined, both by single envelope
108  regions and by stochastic traceback clustering into one or more
109  envelopes per region.
110
111\item[\monob{(16) dom:}] Number of domains defined. In general, this
112  is the same as the number of envelopes: for each envelope, we find
113  an MEA (maximum expected accuracy) alignment, which defines the
114  endpoints of the alignable domain.
115
116\item[\monob{(17) rep:}]
117  Number of domains satisfying reporting thresholds. If you've also
118  saved a \mono{-{}-domtblout} file, there will be one line in it
119  for each reported domain.
120
121\item[\monob{(18) inc:}]
122  Number of domains satisfying inclusion thresholds.
123
124\item[\monob{(19) description of target:}]
125  The remainder of the line is the target's description line, as free text.
126\end{description}
127
128
129
130\paragraph{tblout fields for DNA search programs}
131
132In the DNA search programs, there is less concentration on domains, and more
133focus on presenting the hit ranges. Each line consists of \textbf{15
134space-delimited fields} followed by a free text target sequence description, as follows:
135
136\begin{description}
137\item[\monob{(1) target name:}]
138  The name of the target sequence or profile.
139
140\item[\monob{(2) accession:}]
141  The accession of the target sequence or profile, or '-' if none.
142
143\item[\monob{(3) query name:}]
144  The name of the query sequence or profile.
145
146\item[\monob{(4) accession:}]
147  The accession of the query sequence or profile, or '-' if none.
148
149\item[\monob{(5) hmmfrom:}]
150  The position in the hmm at which the hit starts.
151
152\item[\monob{(6) hmm to:}]
153  The position in the hmm at which the hit ends.
154
155\item[\monob{(7) alifrom:}]
156  The position in the target sequence at which the hit starts.
157
158\item[\monob{(8) ali to:}]
159  The position in the target sequence at which the hit ends.
160
161\item[\monob{(9) envfrom:}]
162  The position in the target sequence at which the surrounding envelope starts.
163
164\item[\monob{(10) env to:}]
165  The position in the target sequence at which the surrounding envelope ends.
166
167\item[\monob{(11) sq len:}]
168  The length of the target sequence..
169
170\item[\monob{(12) strand:}]
171  The strand on which the hit was found (``-" when alifrom>ali to).
172
173\item[\monob{(13) E-value:}] The expectation value
174  (statistical significance) of the target, as above.
175
176\item[\monob{(14) score (full sequence):}]
177  The score (in bits) for this hit. It includes the biased-composition
178  correction.
179
180\item[\monob{(15) Bias (full sequence):}] The biased-composition
181  correction, as above
182
183\item[\monob{(16) description of target:}]
184  The remainder of the line is the target's description line, as free text.
185\end{description}
186
187
188These tables are columnated neatly for human readability, but do not
189write parsers that rely on this columnation; rely on space-delimited
190fields. The pretty columnation assumes fixed maximum widths for each
191field. If a field exceeds its allotted width, it will still be fully
192represented and space-delimited, but the columnation will be disrupted
193on the rest of the row.
194
195Note the use of target and query columns. A program like
196\mono{hmmsearch} searches a query profile against a target sequence
197database. In an \mono{hmmsearch} tblout file, the sequence (target)
198name is first, and the profile (query) name is second. A program like
199\mono{hmmscan}, on the other hand, searches a query sequence against a
200target profile database. In a \mono{hmmscan} tblout file, the profile
201name is first, and the sequence name is second. You might say, hey,
202wouldn't it be more consistent to put the profile name first and the
203sequence name second (or vice versa), so \mono{hmmsearch} and
204\mono{hmmscan} tblout files were identical? Well, first of all, they
205still wouldn't be identical, because the target database size used for
206E-value calculations is different (number of target sequences for
207\mono{hmmsearch}, number of target profiles for \mono{hmmscan}, and
208  it's good not to forget this. Second, what about programs like
209  \mono{phmmer} where the query is a sequence and the targets are also
210  sequences?
211
212If the ``domain number estimation'' section of the protein table (exp, reg,
213clu, ov, env, dom, rep, inc) makes no sense to you, it may help to
214read the previous section of the manual, which describes the HMMER
215processing pipeline, including the steps that probabilistically define
216domain locations in a sequence.
217
218\section{The domain hits table (protein search only)}
219
220In protein search programs, the \mono{-{}-domtblout} option produces the
221\emph{domain hits table}. There is one line for each domain. There may be more than
222one domain per sequence. The domain table has \textbf{22
223  whitespace-delimited fields} followed by a free text target sequence
224description, as follows:
225
226\begin{description}
227\item[\monob{(1) target name:}] The name of the target sequence or  profile.
228
229\item[\monob{(2) target accession:}] Accession of the target sequence
230  or profile, or '-' if none is available.
231
232\item[\monob{(3) tlen:}] Length of the target sequence or profile, in residues.
233  This (together with the query length) is useful for interpreting
234  where the domain coordinates (in subsequent columns) lie in the
235  sequence.
236
237\item[\monob{(4) query name:}] Name of the query sequence or profile.
238
239\item[\monob{(5) accession:}] Accession of the target sequence or
240  profile, or '-' if none is available.
241
242\item[\monob{(6) qlen:}]  Length of the query sequence or profile, in residues.
243
244\item[\monob{(7) E-value:}] E-value of the overall sequence/profile
245  comparison (including all domains).
246
247\item[\monob{(8) score:}] Bit score of the overall sequence/profile
248  comparison (including all domains), inclusive of a null2 bias
249  composition correction to the score.
250
251\item[\monob{(9) bias:}] The biased composition score correction that
252  was applied to the bit score.
253
254\item[\monob{(10) \#:}] This domain's number (1..ndom).
255
256\item[\monob{(11) of:}] The total number of domains reported in the
257  sequence, ndom.
258
259\item[\monob{(12) c-Evalue:}] The ``conditional E-value'', a
260  permissive measure of how reliable this particular domain may be.
261  The conditional E-value is calculated on a smaller search space than
262  the independent E-value. The conditional E-value uses the number of
263  targets that pass the reporting thresholds. The null hypothesis test
264  posed by the conditional E-value is as follows. Suppose that we
265  believe that there is already sufficient evidence (from other
266  domains) to identify the set of reported sequences as homologs of
267  our query; now, how many \emph{additional} domains would we expect
268  to find with at least this particular domain's bit score, if the
269  rest of those reported sequences were random nonhomologous sequence
270  (i.e.\ outside the other domain(s) that were sufficient to
271  identified them as homologs in the first place)?
272
273\item[\monob{(13) i-Evalue:}] The ``independent E-value'', the
274  E-value that the sequence/profile comparison would have received if
275  this were the only domain envelope found in it, excluding any
276  others. This is a stringent measure of how reliable this particular
277  domain may be. The independent E-value uses the total number of
278  targets in the target database.
279
280\item[\monob{(14) score:}] The bit score for this domain.
281
282\item[\monob{(15) bias:}] The biased composition (null2) score
283  correction that was applied to the domain bit score.
284
285\item[\monob{(16) from (hmm coord):}]
286  The start of the MEA alignment of this domain with respect to the
287  profile, numbered 1..N for a profile of N consensus positions.
288
289\item[\monob{(17) to (hmm coord):}]
290  The end of the MEA alignment of this domain with respect to the
291  profile, numbered 1..N for a profile of N consensus positions.
292
293\item[\monob{(18) from (ali coord):}]
294  The start of the MEA alignment of this domain with respect to the
295  sequence, numbered 1..L for a sequence of L residues.
296
297\item[\monob{(19) to (ali coord):}]
298  The end of the MEA alignment of this domain with respect to the
299  sequence, numbered 1..L for a sequence of L residues.
300
301\item[\monob{(20) from (env coord):}] The start of the domain
302  envelope on the sequence, numbered 1..L for a sequence of L
303  residues. The \emph{envelope} defines a subsequence for which their
304  is substantial probability mass supporting a homologous domain,
305  whether or not a single discrete alignment can be identified.
306  The envelope may extend beyond the endpoints of the MEA alignment,
307  and in fact often does, for weakly scoring domains.
308
309\item[\monob{(21) to (env coord):}] The end of the domain
310  envelope on the sequence, numbered 1..L for a sequence of L
311  residues.
312
313\item[\monob{(22) acc:}] The mean posterior probability of aligned
314  residues in the MEA alignment; a measure of how reliable the overall
315  alignment is (from 0 to 1, with 1.00 indicating a completely
316  reliable alignment according to the model).
317
318\item[\monob{(23) description of target:}] The remainder of the line
319  is the target's description line, as free text.
320\end{description}
321
322As with the target hits table (above), this table is columnated neatly
323for human readability, but you should not write parsers that rely on
324this columnation; parse based on space-delimited fields instead.
325