1 2### Multiple sequence alignment file formats 3 4Easel programs can input and output ten different multiple sequence 5alignment formats. There are five main formats: 6 7| format | i.e. | suffix | 8|---------------|--------------------|------------------| 9| `stockholm` | Stockholm | .sto, .sth, .stk | 10| `afa` | aligned FASTA | .afa, .afasta | 11| `clustal` | CLUSTAL | | 12| `phylip` | interleaved PHYLIP | .ph, .phy, .phyi | 13| `selex` | SELEX | .slx, .selex | 14 15and five variants: 16 17| format | i.e. | is like: | but: | suffix | 18|---------------|-------------------|-------------|------------------------------------------------------|-----------| 19| `pfam` | Pfam | `stockholm` | is restricted to one block | .pfam | 20| `a2m` | UCSC A2M, dotless | `afa` | has additional semantics for consensus columns | .a2m | 21| `clustallike` | Clustal-like | `clustal` | has another program name on first line (e.g. MUSCLE) | | 22| `phylips` | sequential Phylip | `phylip` | "sequential", rather than "interleaved" | .phys | 23| `psiblast` | NCBI PSI-BLAST | `selex` | is just an alignment, has no selex annotation lines | .pb | 24 25 26The _format_ code is what you type to select a format in a command 27line option, as in `--informat selex` or `--outformat afa`. These 28codes are treated case-insensitively, so `--informat SELEX` or 29`--outformat AFA` are also fine. 30 31### How alignment file formats are guessed 32 33Normally when you open an alignment file, an Easel-based program tries 34to guess its format. This saves typing and synapses when you're 35working at the command line. 36 37The guesser will never misidentify the format in a way that would 38corrupt the input alignment or change the annotation. There are 39formats that are problematic to distinguish based on content alone: 40`afa` versus `a2m`, and `phylip` versus `phylips`. 41 42For PHYLIP files, if no hint is available from a file suffix, the 43guesser will nonetheless almost always be able tell the difference and 44call `phylip` versus `phylips`. Pathological edge cases do exist, 45though, where the guesser will return an error about not being able to 46distinguish interleaved from sequential. 47 48However, `afa` and `a2m` files are so easily confusable that the 49guesser will not try to distinguish them based on content alone. The 50only way to get the guesser to call `a2m` is on a file with an 51explicit .a2m suffix. 52 53If you are doing scripted high throughput analysis on files in one of 54these formats, consider specifying your input file format and 55disabling the format guesser. The commandline option for this is 56usually something like `--informat <fmtcode>`. Alternatively, use file 57suffixes: `.afa` versus `.a2m`, or `.ph`/`.phy`/`.phyi` versus `.phys` 58to tip off the guesser. 59 60The guesser works with the following information: 61 * an initial guess based on peeking at the first line of the input 62 * if the input is a file with a file name, it uses the suffix as a clue (to distinguish .a2m versus .afa, 63 or .phyi from .phys, for example) 64 * in more difficult cases, the guesser looks more deeply into the input 65 66More specifically: 67 68#### `stockholm`, `pfam` formats 69 70If the first line starts with `# STOCKHOLM`: guess `stockholm`, unless 71the file suffix is `.pfam`, then guess `pfam`. 72 73Pfam format is just Stockholm, but restricted to a single alignment 74block. There is no difference in the alignment or annotation, so it is 75harmless to read a Pfam file as Stockholm. 76 77#### `afa`, `a2m` formats 78 79If the first line starts with `>`: if the file suffix is `.a2m`, guess 80`a2m`. Otherwise, call `afa`. 81 82The guesser does not autodetect a2m format unless we have a `.a2m` 83suffix on the file, even though it is usually possible to distinguish 84afa from a2m. In afa, the number of aligned characters is always the 85same but the number of upper case + dash characters can vary, whereas 86the opposite is true for a2m. However, it is common to have an afa 87format alignment that consists of all upper case and dashes: 88 89```bash 90>seq1 91GGG-CCC-TT 92>seq2 93GG-GCC-TT- 94``` 95 96which is also valid as a2m. Although the alignment would be the same 97in either format, in a2m we would infer reference consensus 98annotation, and in afa we wouldn't. The guesser is not allowed to risk 99altering either alignment or annotation. Therefore a2m input requires 100something affirmative like the `.a2m` file suffix or a `--informat 101a2m` option. 102 103It's also worth noting that other ambiguous cases exist that imply 104different alignments in the two formats, as in this singularly 105terrifying example: 106 107```bash 108this input: means in AFA: means in A2M: 109>seq1 seq1 AAAcAA seq1 A.AAcAA 110AAAcAA seq2 AcAAAA seq2 AcAA.AA 111>seq2 112AcAAAA 113``` 114 115 116#### `clustal`, `clustallike` formats 117 118If the first line of the input starts with `CLUSTAL`, guess `clustal`. 119If the first line contains the phrase `multiple sequence alignment`, 120guess `clustallike`. The file suffix doesn't matter. 121 122Clustal and Clustal-like formats are parsed identically. The only 123difference is the name of the program on the first line. 124 125#### `phylip`, `phylips` formats 126 127If the first line of the input starts with two integers, assume that 128they are _nseq_ and _alen_, the number of sequences and number of 129alignment columns for a Phylip-format alignment that follows. If we 130have a suffix and it is `.ph`, `.phy`, or `.phyi`, guess `phylip`; if 131it is `.phys`, guess `phylips`. In both cases, the name width is 132assumed to be the Phylip standard 10. 133 134Otherwise the guesser then looks deeper into the input to distinguish 135interleaved from sequential variants of the format, and to check 136whether the input is using the standard 10-character Phylip name width 137or a noncanonical width: 138 139 * If the file is consistent with interleaved format, it is called 140 `phylip` format. The standard 10 character namewidth is tried first, 141 and if that doesn't work, a nonstandard namewidth is determined. 142 143 * else, if the file is consistent with sequential format, it is 144 called `phylips` format. The standard 10 character namewidth is 145 tried first; if that fails, a nonstandard namewidth is determined. 146 147It is possible to construct pathological files that are consistent 148with both interleaved and sequential formats. If you're working with 149sequential Phylip files and you need to guarantee accuracy, use a 150command line option like `--informat phylips`. 151 152 153#### `selex`, `psiblast` formats 154 155If the first line of the input doesn't conform to any of the formats 156above, and we have a suffix `.slx`, guess `selex`; if we have a suffix `.pb`, guess 157`psiblast`. 158 159Otherwise the guesser looks deeper, and tests for whether the input 160consistent with SELEX format; if it is, guess `selex`. 161 162Because PSI-BLAST is a strict subset, any file consistent with SELEX 163format will be guessed to be _selex_; reading a _psiblast_ file as 164_selex_ is harmless. If you have a legitimate _psiblast_ file and you 165want to enforce stricter parsing, use a `.pb` file suffix on it, or 166use a commandline option like `--informat psiblast` to bypass the 167guesser. 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182