1
2### Multiple sequence alignment file formats
3
4Easel programs can input and output ten different multiple sequence
5alignment formats. There are five main formats:
6
7| format        |   i.e.             | suffix           |
8|---------------|--------------------|------------------|
9| `stockholm`   | Stockholm          | .sto, .sth, .stk |
10| `afa`         | aligned FASTA      | .afa, .afasta    |
11| `clustal`     | CLUSTAL            |                  |
12| `phylip`      | interleaved PHYLIP | .ph, .phy, .phyi |
13| `selex`       | SELEX              | .slx, .selex     |
14
15and five variants:
16
17| format        | i.e.              | is like:    |  but:                                                |   suffix  |
18|---------------|-------------------|-------------|------------------------------------------------------|-----------|
19| `pfam`        | Pfam              | `stockholm` | is restricted to one block                           |  .pfam    |
20| `a2m`         | UCSC A2M, dotless | `afa`       | has additional semantics for consensus columns       |  .a2m     |
21| `clustallike` | Clustal-like      | `clustal`   | has another program name on first line (e.g. MUSCLE) |           |
22| `phylips`     | sequential Phylip | `phylip`    | "sequential", rather than "interleaved"              |  .phys    |
23| `psiblast`    | NCBI PSI-BLAST    | `selex`     | is just an alignment, has no selex annotation lines  |  .pb      |
24
25
26The _format_ code is what you type to select a format in a command
27line option, as in `--informat selex` or `--outformat afa`. These
28codes are treated case-insensitively, so `--informat SELEX` or
29`--outformat AFA` are also fine.
30
31### How alignment file formats are guessed
32
33Normally when you open an alignment file, an Easel-based program tries
34to guess its format. This saves typing and synapses when you're
35working at the command line.
36
37The guesser will never misidentify the format in a way that would
38corrupt the input alignment or change the annotation. There are
39formats that are problematic to distinguish based on content alone:
40`afa` versus `a2m`, and `phylip` versus `phylips`.
41
42For PHYLIP files, if no hint is available from a file suffix, the
43guesser will nonetheless almost always be able tell the difference and
44call `phylip` versus `phylips`.  Pathological edge cases do exist,
45though, where the guesser will return an error about not being able to
46distinguish interleaved from sequential.
47
48However, `afa` and `a2m` files are so easily confusable that the
49guesser will not try to distinguish them based on content alone. The
50only way to get the guesser to call `a2m` is on a file with an
51explicit .a2m suffix.
52
53If you are doing scripted high throughput analysis on files in one of
54these formats, consider specifying your input file format and
55disabling the format guesser. The commandline option for this is
56usually something like `--informat <fmtcode>`. Alternatively, use file
57suffixes: `.afa` versus `.a2m`, or `.ph`/`.phy`/`.phyi` versus `.phys`
58to tip off the guesser.
59
60The guesser works with the following information:
61 * an initial guess based on peeking at the first line of the input
62 * if the input is a file with a file name, it uses the suffix as a clue (to distinguish .a2m versus .afa,
63   or .phyi from .phys, for example)
64 * in more difficult cases, the guesser looks more deeply into the input
65
66More specifically:
67
68#### `stockholm`, `pfam` formats
69
70If the first line starts with `# STOCKHOLM`: guess `stockholm`, unless
71the file suffix is `.pfam`, then guess `pfam`.
72
73Pfam format is just Stockholm, but restricted to a single alignment
74block. There is no difference in the alignment or annotation, so it is
75harmless to read a Pfam file as Stockholm.
76
77#### `afa`, `a2m` formats
78
79If the first line starts with `>`: if the file suffix is `.a2m`, guess
80`a2m`. Otherwise, call `afa`.
81
82The guesser does not autodetect a2m format unless we have a `.a2m`
83suffix on the file, even though it is usually possible to distinguish
84afa from a2m. In afa, the number of aligned characters is always the
85same but the number of upper case + dash characters can vary, whereas
86the opposite is true for a2m. However, it is common to have an afa
87format alignment that consists of all upper case and dashes:
88
89```bash
90>seq1
91GGG-CCC-TT
92>seq2
93GG-GCC-TT-
94```
95
96which is also valid as a2m. Although the alignment would be the same
97in either format, in a2m we would infer reference consensus
98annotation, and in afa we wouldn't. The guesser is not allowed to risk
99altering either alignment or annotation. Therefore a2m input requires
100something affirmative like the `.a2m` file suffix or a `--informat
101a2m` option.
102
103It's also worth noting that other ambiguous cases exist that imply
104different alignments in the two formats, as in this singularly
105terrifying example:
106
107```bash
108this input:    means in AFA:    means in A2M:
109>seq1          seq1 AAAcAA      seq1 A.AAcAA
110AAAcAA         seq2 AcAAAA      seq2 AcAA.AA
111>seq2
112AcAAAA
113```
114
115
116#### `clustal`, `clustallike` formats
117
118If the first line of the input starts with `CLUSTAL`, guess `clustal`.
119If the first line contains the phrase `multiple sequence alignment`,
120guess `clustallike`. The file suffix doesn't matter.
121
122Clustal and Clustal-like formats are parsed identically. The only
123difference is the name of the program on the first line.
124
125#### `phylip`, `phylips` formats
126
127If the first line of the input starts with two integers, assume that
128they are _nseq_ and _alen_, the number of sequences and number of
129alignment columns for a Phylip-format alignment that follows.  If we
130have a suffix and it is `.ph`, `.phy`, or `.phyi`, guess `phylip`; if
131it is `.phys`, guess `phylips`. In both cases, the name width is
132assumed to be the Phylip standard 10.
133
134Otherwise the guesser then looks deeper into the input to distinguish
135interleaved from sequential variants of the format, and to check
136whether the input is using the standard 10-character Phylip name width
137or a noncanonical width:
138
139 * If the file is consistent with interleaved format, it is called
140   `phylip` format. The standard 10 character namewidth is tried first,
141   and if that doesn't work, a nonstandard namewidth is determined.
142
143 * else, if the file is consistent with sequential format, it is
144   called `phylips` format. The standard 10 character namewidth is
145   tried first; if that fails, a nonstandard namewidth is determined.
146
147It is possible to construct pathological files that are consistent
148with both interleaved and sequential formats.  If you're working with
149sequential Phylip files and you need to guarantee accuracy, use a
150command line option like `--informat phylips`.
151
152
153#### `selex`, `psiblast` formats
154
155If the first line of the input doesn't conform to any of the formats
156above, and we have a suffix `.slx`, guess `selex`; if we have a suffix `.pb`, guess
157`psiblast`.
158
159Otherwise the guesser looks deeper, and tests for whether the input
160consistent with SELEX format; if it is, guess `selex`.
161
162Because PSI-BLAST is a strict subset, any file consistent with SELEX
163format will be guessed to be _selex_; reading a _psiblast_ file as
164_selex_ is harmless.  If you have a legitimate _psiblast_ file and you
165want to enforce stricter parsing, use a `.pb` file suffix on it, or
166use a commandline option like `--informat psiblast` to bypass the
167guesser.
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182