1\chapter{Tabular output formats} 2\label{chapter:tabular} 3\setcounter{footnote}{0} 4 5\section{The target hits table} 6 7The \mono{-{}-tblout} output option produces the \emph{target hits 8 table}. The target hits table consists of one line for each 9different query/target comparison that met the reporting thresholds, 10ranked by decreasing statistical significance (increasing E-value). 11 12 13\paragraph{tblout fields for protein search programs} 14 15In the protein search programs, each line consists of \textbf{18 16space-delimited fields} followed by a free text target sequence description, as 17follows:\marginnote{The \mono{tblout} format is deliberately space-delimited 18(rather than tab-delimited) and justified into aligned columns, so these files 19 are suitable both for automated parsing and for human 20 examination. I feel that tab-delimited data files are difficult for humans to 21 examine and spot check. For this reason, I think tab-delimited 22 files are a minor evil in the world. Although I occasionally 23 receive shrieks of outrage about this, I still stubbornly feel that 24 space-delimited files are just as easily parsed as tab-delimited 25 files.} 26 27\begin{description} 28\item[\monob{(1) target name:}] 29 The name of the target sequence or profile. 30 31\item[\monob{(2) accession:}] 32 The accession of the target sequence or profile, or '-' if none. 33 34\item[\monob{(3) query name:}] 35 The name of the query sequence or profile. 36 37\item[\monob{(4) accession:}] 38 The accession of the query sequence or profile, or '-' if none. 39 40\item[\monob{(5) E-value (full sequence):}] The expectation value 41 (statistical significance) of the target. This is a \emph{per 42 query} E-value; i.e.\ calculated as the expected number of false 43 positives achieving this comparison's score for a \emph{single} 44 query against the $Z$ sequences in the target dataset. If you 45 search with multiple queries and if you want to control the 46 \emph{overall} false positive rate of that search rather than the 47 false positive rate per query, you will want to multiply this 48 per-query E-value by how many queries you're doing. 49 50\item[\monob{(6) score (full sequence):}] 51 The score (in bits) for this target/query comparison. It includes 52 the biased-composition correction (the ``null2'' model). 53 54\item[\monob{(7) Bias (full sequence):}] The biased-composition 55 correction: the bit score difference contributed by the null2 56 model. High bias scores may be a red flag for a false positive, 57 especially when the bias score is as large or larger than the 58 overall bit score. It is difficult to correct for all possible ways 59 in which a nonrandom but nonhomologous biological sequences can 60 appear to be similar, such as short-period tandem repeats, so there 61 are cases where the bias correction is not strong enough (creating 62 false positives). 63 64\item[\monob{(8) E-value (best 1 domain):}] The E-value if only the 65 single best-scoring domain envelope were found in the sequence, and 66 none of the others. If this E-value isn't good, but the full 67 sequence E-value is good, this is a potential red flag. Weak hits, 68 none of which are good enough on their own, are summing up to lift 69 the sequence up to a high score. Whether this is Good or Bad is not 70 clear; the sequence may contain several weak homologous domains, or 71 it might contain a repetitive sequence that is hitting by chance 72 (i.e. once one repeat hits, all the repeats hit). 73 74\item[\monob{(9) score (best 1 domain):}] The bit score if only the 75 single best-scoring domain envelope were found in the sequence, and 76 none of the others. (Inclusive of the null2 bias correction.] 77 78\item[\monob{(10) bias (best 1 domain):}] The null2 bias correction 79 that was applied to the bit score of the single best-scoring domain. 80 81\item[\monob{(11) exp:}] Expected number of domains, as calculated by 82 posterior decoding on the mean number of begin states used in the 83 alignment ensemble. 84 85\item[\monob{(12) reg:}] Number of discrete regions defined, as 86 calculated by heuristics applied to posterior decoding of begin/end 87 state positions in the alignment ensemble. The number of regions 88 will generally be close to the expected number of domains. The more 89 different the two numbers are, the less discrete the regions appear 90 to be, in terms of probability mass. This usually means one of two 91 things. On the one hand, weak homologous domains may be difficult 92 for the heuristics to identify clearly. On the other hand, 93 repetitive sequence may appear to have a high expected domain number 94 (from lots of crappy possible alignments in the ensemble, no one of 95 which is very convincing on its own, so no one region is discretely 96 well-defined). 97 98\item[\monob{(13) clu:}] Number of regions that appeared to be 99 multidomain, and therefore were passed to stochastic traceback 100 clustering for further resolution down to one or more 101 envelopes. This number is often zero. 102 103\item[\monob{(14) ov:}] For envelopes that were defined by stochastic 104 traceback clustering, how many of them overlap other envelopes. 105 106\item[\monob{(15) env:}] 107 The total number of envelopes defined, both by single envelope 108 regions and by stochastic traceback clustering into one or more 109 envelopes per region. 110 111\item[\monob{(16) dom:}] Number of domains defined. In general, this 112 is the same as the number of envelopes: for each envelope, we find 113 an MEA (maximum expected accuracy) alignment, which defines the 114 endpoints of the alignable domain. 115 116\item[\monob{(17) rep:}] 117 Number of domains satisfying reporting thresholds. If you've also 118 saved a \mono{-{}-domtblout} file, there will be one line in it 119 for each reported domain. 120 121\item[\monob{(18) inc:}] 122 Number of domains satisfying inclusion thresholds. 123 124\item[\monob{(19) description of target:}] 125 The remainder of the line is the target's description line, as free text. 126\end{description} 127 128 129 130\paragraph{tblout fields for DNA search programs} 131 132In the DNA search programs, there is less concentration on domains, and more 133focus on presenting the hit ranges. Each line consists of \textbf{15 134space-delimited fields} followed by a free text target sequence description, as follows: 135 136\begin{description} 137\item[\monob{(1) target name:}] 138 The name of the target sequence or profile. 139 140\item[\monob{(2) accession:}] 141 The accession of the target sequence or profile, or '-' if none. 142 143\item[\monob{(3) query name:}] 144 The name of the query sequence or profile. 145 146\item[\monob{(4) accession:}] 147 The accession of the query sequence or profile, or '-' if none. 148 149\item[\monob{(5) hmmfrom:}] 150 The position in the hmm at which the hit starts. 151 152\item[\monob{(6) hmm to:}] 153 The position in the hmm at which the hit ends. 154 155\item[\monob{(7) alifrom:}] 156 The position in the target sequence at which the hit starts. 157 158\item[\monob{(8) ali to:}] 159 The position in the target sequence at which the hit ends. 160 161\item[\monob{(9) envfrom:}] 162 The position in the target sequence at which the surrounding envelope starts. 163 164\item[\monob{(10) env to:}] 165 The position in the target sequence at which the surrounding envelope ends. 166 167\item[\monob{(11) sq len:}] 168 The length of the target sequence.. 169 170\item[\monob{(12) strand:}] 171 The strand on which the hit was found (``-" when alifrom>ali to). 172 173\item[\monob{(13) E-value:}] The expectation value 174 (statistical significance) of the target, as above. 175 176\item[\monob{(14) score (full sequence):}] 177 The score (in bits) for this hit. It includes the biased-composition 178 correction. 179 180\item[\monob{(15) Bias (full sequence):}] The biased-composition 181 correction, as above 182 183\item[\monob{(16) description of target:}] 184 The remainder of the line is the target's description line, as free text. 185\end{description} 186 187 188These tables are columnated neatly for human readability, but do not 189write parsers that rely on this columnation; rely on space-delimited 190fields. The pretty columnation assumes fixed maximum widths for each 191field. If a field exceeds its allotted width, it will still be fully 192represented and space-delimited, but the columnation will be disrupted 193on the rest of the row. 194 195Note the use of target and query columns. A program like 196\mono{hmmsearch} searches a query profile against a target sequence 197database. In an \mono{hmmsearch} tblout file, the sequence (target) 198name is first, and the profile (query) name is second. A program like 199\mono{hmmscan}, on the other hand, searches a query sequence against a 200target profile database. In a \mono{hmmscan} tblout file, the profile 201name is first, and the sequence name is second. You might say, hey, 202wouldn't it be more consistent to put the profile name first and the 203sequence name second (or vice versa), so \mono{hmmsearch} and 204\mono{hmmscan} tblout files were identical? Well, first of all, they 205still wouldn't be identical, because the target database size used for 206E-value calculations is different (number of target sequences for 207\mono{hmmsearch}, number of target profiles for \mono{hmmscan}, and 208 it's good not to forget this. Second, what about programs like 209 \mono{phmmer} where the query is a sequence and the targets are also 210 sequences? 211 212If the ``domain number estimation'' section of the protein table (exp, reg, 213clu, ov, env, dom, rep, inc) makes no sense to you, it may help to 214read the previous section of the manual, which describes the HMMER 215processing pipeline, including the steps that probabilistically define 216domain locations in a sequence. 217 218\section{The domain hits table (protein search only)} 219 220In protein search programs, the \mono{-{}-domtblout} option produces the 221\emph{domain hits table}. There is one line for each domain. There may be more than 222one domain per sequence. The domain table has \textbf{22 223 whitespace-delimited fields} followed by a free text target sequence 224description, as follows: 225 226\begin{description} 227\item[\monob{(1) target name:}] The name of the target sequence or profile. 228 229\item[\monob{(2) target accession:}] Accession of the target sequence 230 or profile, or '-' if none is available. 231 232\item[\monob{(3) tlen:}] Length of the target sequence or profile, in residues. 233 This (together with the query length) is useful for interpreting 234 where the domain coordinates (in subsequent columns) lie in the 235 sequence. 236 237\item[\monob{(4) query name:}] Name of the query sequence or profile. 238 239\item[\monob{(5) accession:}] Accession of the target sequence or 240 profile, or '-' if none is available. 241 242\item[\monob{(6) qlen:}] Length of the query sequence or profile, in residues. 243 244\item[\monob{(7) E-value:}] E-value of the overall sequence/profile 245 comparison (including all domains). 246 247\item[\monob{(8) score:}] Bit score of the overall sequence/profile 248 comparison (including all domains), inclusive of a null2 bias 249 composition correction to the score. 250 251\item[\monob{(9) bias:}] The biased composition score correction that 252 was applied to the bit score. 253 254\item[\monob{(10) \#:}] This domain's number (1..ndom). 255 256\item[\monob{(11) of:}] The total number of domains reported in the 257 sequence, ndom. 258 259\item[\monob{(12) c-Evalue:}] The ``conditional E-value'', a 260 permissive measure of how reliable this particular domain may be. 261 The conditional E-value is calculated on a smaller search space than 262 the independent E-value. The conditional E-value uses the number of 263 targets that pass the reporting thresholds. The null hypothesis test 264 posed by the conditional E-value is as follows. Suppose that we 265 believe that there is already sufficient evidence (from other 266 domains) to identify the set of reported sequences as homologs of 267 our query; now, how many \emph{additional} domains would we expect 268 to find with at least this particular domain's bit score, if the 269 rest of those reported sequences were random nonhomologous sequence 270 (i.e.\ outside the other domain(s) that were sufficient to 271 identified them as homologs in the first place)? 272 273\item[\monob{(13) i-Evalue:}] The ``independent E-value'', the 274 E-value that the sequence/profile comparison would have received if 275 this were the only domain envelope found in it, excluding any 276 others. This is a stringent measure of how reliable this particular 277 domain may be. The independent E-value uses the total number of 278 targets in the target database. 279 280\item[\monob{(14) score:}] The bit score for this domain. 281 282\item[\monob{(15) bias:}] The biased composition (null2) score 283 correction that was applied to the domain bit score. 284 285\item[\monob{(16) from (hmm coord):}] 286 The start of the MEA alignment of this domain with respect to the 287 profile, numbered 1..N for a profile of N consensus positions. 288 289\item[\monob{(17) to (hmm coord):}] 290 The end of the MEA alignment of this domain with respect to the 291 profile, numbered 1..N for a profile of N consensus positions. 292 293\item[\monob{(18) from (ali coord):}] 294 The start of the MEA alignment of this domain with respect to the 295 sequence, numbered 1..L for a sequence of L residues. 296 297\item[\monob{(19) to (ali coord):}] 298 The end of the MEA alignment of this domain with respect to the 299 sequence, numbered 1..L for a sequence of L residues. 300 301\item[\monob{(20) from (env coord):}] The start of the domain 302 envelope on the sequence, numbered 1..L for a sequence of L 303 residues. The \emph{envelope} defines a subsequence for which their 304 is substantial probability mass supporting a homologous domain, 305 whether or not a single discrete alignment can be identified. 306 The envelope may extend beyond the endpoints of the MEA alignment, 307 and in fact often does, for weakly scoring domains. 308 309\item[\monob{(21) to (env coord):}] The end of the domain 310 envelope on the sequence, numbered 1..L for a sequence of L 311 residues. 312 313\item[\monob{(22) acc:}] The mean posterior probability of aligned 314 residues in the MEA alignment; a measure of how reliable the overall 315 alignment is (from 0 to 1, with 1.00 indicating a completely 316 reliable alignment according to the model). 317 318\item[\monob{(23) description of target:}] The remainder of the line 319 is the target's description line, as free text. 320\end{description} 321 322As with the target hits table (above), this table is columnated neatly 323for human readability, but you should not write parsers that rely on 324this columnation; parse based on space-delimited fields instead. 325