1# esl_dsqdata: efficient sequence database format 2 3The `dsqdata` module implements a binary sequence data format. It 4accelerates sequence data input in four ways, compared to Easel 5flatfile parsers in `sqio`: 6 7* __Asynchronous input.__ 8 Disk and CPU resources are used concurrently, using POSIX threads. 9 A "loader" thread does essentially nothing but read chunks of 10 data. An "unpacker" thread does CPU work to prepare loaded 11 sequence data chunks for consumption. If it takes time $R$ to read 12 and $P$ to process the data, instead of overall time $R+P$, with 13 asynchronous input we only need time $\mathrm{max}(R,P)$. 14 15* __Predigitization.__ 16 Sequence data in the `dsqdata` format are already encoded in 17 Easel digital sequence format. User-oriented error checking is done 18 up front when the `dsqdata` file is created. 19 20* __Bit packing.__ 21 Disk read time is typically rate-limiting in HMMER and other 22 Easel-based programs, so minimizing data volume is critical. 23 Sequence data are packed bitwise in 32-bit packets to reduce volume 24 by a factor of 1.5 (protein) to 3.75 (nucleic). A packet contains 25 six 5-bit residues (protein or degenerate nucleic) or fifteen 2-bit 26 residues (canonical nucleic) and two control bits. 27 28* __Separate metadata.__ 29 Sequence data and metadata (name, accession, description, taxonomy 30 identifier) are stored separately in `.dsqs` and `.dsqm` 31 files. This streamlines unpacking, because these data are handled 32 differently. It also allows a deferred metadata read: sequences may 33 be identified simply by index number during an initial processing 34 sweep, and metadata can be loaded later by random access for a small 35 number of targets of interest. 36 37The following table lists the functions in the `dsqdata` API. 38 39| Function | Synopsis | 40|--------------------------------|--------------------------------------------------------------| 41| `esl_dsqdata_Open()` | Open a digital sequence database for reading | 42| `esl_dsqdata_Read()` | Read next chunk of sequence data. | 43| `esl_dsqdata_Recycle()` | Give a chunk back to the reader. | 44| `esl_dsqdata_Close()` | Close a dsqdata reader. | 45| `esl_dsqdata_Write()` | Create a dsqdata database | 46 47 48## dsqdata format's four files 49 50The format of a database `mydb` consists of four files: 51 52| File | Contents | Description | 53|-------------|-----------|--------------------------------------------------------------| 54| `mydb` | Stub | Human-readable information about the data | 55| `mydb.dsqi` | Index | Disk offsets for each seq in metadata and sequence files | 56| `mydb.dsqm` | Metadata | Name, accession, description, and taxonomy ids | 57| `mydb.dsqs` | Sequence | Sequences (digitized, packed) | 58 59The database is specified on command lines by the name of the stub 60file (`mydb`), without any suffix. For example, 61 62 % myprogram mydb 63 64says to open `mydb`. The `esl_dsqdata_Open()` call then opens all four 65files. 66 67 68## definition of dsqdata file formats 69 70### the stub file 71 72An example stub file: 73 74``` 75Easel dsqdata v1 x4019752601 76 77Original file: refprot.fa 78Original format: FASTA 79Type: amino 80Sequences: 11432138 81Residues: 4358716588 82``` 83 84The first line is the only line in the stub file that's parsed by the 85reader. Its text format matches `/Easel dsqdata v(\d+) x(\d+)/`. The first 86field is a version number for the format, $\geq$ 1. It is currently unused, but 87in the future we might need it to parse different versions of the format, if we 88need to update it. The second field is a 32-bit unsigned integer tag in the 89range 0..$2^{32}-1$. Each of the four files carries the same randomly generated 90tag. The tag is used to ascertain that the four files belong together in the 91same database, as opposed to one or more of them being inadvertently clobbered 92somehow by the user. 93 94After the first line, the rest of the stub file is ignored by the Easel reader, 95and can contain anything -- even your own notes, if you want to add any. The 96text here is the useful information that the Easel writer writes by default. 97 98### the .dsqi index file 99 100The purpose of the .dsqi index file is to (quickly) give us data offsets (in 101bytes) we need to randomly access the metadata and sequence data for sequence 102number 0..nseq-1 in the .dsqm and .dsqs files, or for any specified range of 103sequences. 104 105The header of the binary index file consists of: 106 107| name | type | description | 108|--------------|------------|----------------------------------------------| 109| magic | `uint32_t` | magic number (version, byte order) | 110| uniquetag | `uint32_t` | random integer tag (0..$2^{32}-1$) | 111| alphatype | `uint32_t` | alphabet type code (1,2,3 = RNA, DNA, amino) | 112| flags | `uint32_t` | Currently 0. Reserved for future flags | 113| max_namelen | `uint32_t` | Maximum seq name length in metadata | 114| max_acclen | `uint32_t` | Maximum accession length in metadata | 115| max_desclen | `uint32_t` | Maximum description length in metadata | 116| max_seqlen | `uint64_t` | Maximum sequence length | 117| nseq | `uint64_t` | Total number of sequences in database | 118| nres | `uint64_t` | Total number of residues in database | 119 120The **magic** is used to check that the file is indeed a dsqdata 121format file, and to detect byte order swapping. Valid values for the 122magic version/byteorder number are: 123 124| value | derivation | description | 125|------------|---------------------|--------------------------| 126| 0xc4d3d1b1 | "dsq1" + 0x80808080 | dsqdata version 1 format | 127| 0xb1d1d3c4 | above, byteswapped | above, byteswapped | 128 129The **uniquetag** matches the tag seen in the other files. 130 131The dsqdata packet format is only defined for biological sequence alphabets. 132Valid integer values for the **alphatype** code come from a subset of the codes 133used in `esl_alphabet.h`: 134 135| value | `esl_alphabet.h` | description | 136|-------|------------------|-------------| 137| 1 | `eslRNA` | RNA | 138| 2 | `eslDNA` | DNA | 139| 3 | `eslAMINO` | protein | 140 141The unused **flags** field gives us some flexibility for future 142versions of the format. 143 144The maximum lengths of the names, accessions, and descriptions in the 145metadata file might someday be useful (in making allocations, for 146example) but they are currently unused by the Easel reader. 147 148Likewise, the maximum sequence length, total number of sequences, and 149total number of residues in the database may someday be useful (for 150making decisions about how to partition a parallel search, for 151example), but they are currently unused too. 152 153After the header, the remainder of the file consists of `nseq` 154records of type `ESL_DSQDATA_RECORD` (defined in 155`esl_dsqdata.h`): 156 157| element | type | description | 158|----------------|-----------|-------------------------------------------------------------------------| 159| `metadata_end` | `int64_t` | Position of terminal `\0` of metadata for seq i in .dsqm file, in bytes | 160| `psq_end` | `int64_t` | Position of final packet for sequence i in .dsqs file, in packets | 161 162Storing _end_ positions instead of _start_ positions allows 163us to determine lengths, without needing an n+1'th sentinel record, 164albeit at the cost of special casing what happens for the first 165sequence i=0. For example: 166 167``` 168 len[i] = (i == 0 ? r[i].end + 1 : r[i].end - r[i-1].end) 169 start[i] = (i == 0 ? 0 : r[i-1].end + 1) 170``` 171 172This is equivalent to treating `r[-1].end = -1`. Some of the Easel reader's code 173tracks a `last_end` variable for the end of the previous metadata or packed 174sequence field i-1, which is initialized to -1. This -1 boundary condition is 175why we use _signed_ `int64_t` types. 176 177Packet sequence endpoints are stored in units of unsigned 32-bit binary 178_packets_, not in bytes. To convert to a disk offset or a length in bytes you 179multiply by 4 (`sizeof(uint32_t)`). 180 181Keeping the size of the dsqdata files as small as possible is critical 182because the reading speed is limited by the raw size of the 183data. Therefore we don't store separate positions for the different 184metadata fields (name/accession/description/taxonomy id); only one 185position for all the metadata associated with sequence i. The reader 186reads all of it in one chunk, and parses it for the stored `\0` 187sentinels. 188 189For the same reason, we don't store any information about _unpacked_ sequence 190lengths, only the bare minimum of information that the dsqdata loader and 191unpacker need to locate, load, and unpack the packed data for any given sequence 192i. The unpacker determines the unpacked sequence length when it unpacks the 193data. 194 195 196### the .dsqm metadata file 197 198The metadata file starts with two header fields, the same two that the 199index file starts with: 200 201| name | type | description | 202|-----------|------------|------------------------------------| 203| magic | `uint32_t` | magic number (version, byte order) | 204| uniquetag | `uint32_t` | random integer tag (0..$2^32-1$) | 205 206After the header, the remainder of the file consists of the following 207data for each sequence i = 0..nseq-1: 208 209| field | type | description | 210|--------------|-----------------------------|-----------------------------------------------------| 211| name | `char` array ending in `\0` | sequence name (1 word, no whitespace); mandatory | 212| accession | `char` array ending in `\0` | sequence accession (1 word, no whitespace); or "\0" | 213| description | `char` array ending in `\0` | sequence description line; or "\0" | 214| taxonomy id | `int32_t` | NCBI taxonomy identifier; or -1 | 215 216 217The name, accession, and description are variable length strings. The name and 218accession are single "words" with no whitespace (`\S+`). The description is one 219line, may contain spaces, but may not contain any newlines. All sequences must 220have a name, so `strlen(name) > 0`. The accession and description are optional; 221if they are not present, these are 0-length strings ("\0"). 222 223The taxonomy identifier is an integer in NCBI's taxonomy. Valid taxonomy 224identifiers are $\geq 1$. This field is optional; use a value of -1 to indicate 225unset. 226 227(I cannot find any documentation at NCBI on the maximum range of the taxid, nor 228can I find a clear statement of whether 0 is valid or not. 0 is currently unused 229in the NCBI taxonomy. 1 indicates the top level. That makes it look like it's 230safe to treat 0 as "unset" but it seems even safer to go with -1 and a signed 231integer. Unless NCBI ends up having more than two billion species. Currently 232there are about 1.8 million.) 233 234These names, types, and semantics match the corresponding fields in an `ESL_SQ`. 235 236### the .dsqs sequence file 237 238The sequence file also starts with the same two header fields that the 239index and metadata files started with: 240 241| name | type | description | 242|-----------|------------|------------------------------------| 243| magic | `uint32_t` | magic number (version, byte order) | 244| uniquetag | `uint32_t` | random integer tag (0..$2^32-1$) | 245 246After the header, the remainder of the file consists of the packed 247sequences, with one packet array for each sequence i = 0..nseq-1. Each 248packet array ends with a specially marked sentinel packet. The packet 249format is described next. 250 251#### packet format 252 253Each packet is an unsigned 32 bit integer. The two leading (most 254significant) bits are control bits. Bit 31 signals EOD (end of data): 255the last packet in a packed sequence. Bit 30 signals the packet 256format: 1 for 5-bit, 0 for 2-bit. The remaining bits are the packed 257residue codes: 258 259``` 260 [31] [30] [29..25] [24..20] [19..15] [14..10] [ 9..5 ] [ 4..0 ] 261 ^ ^ |------------ 6 5-bit packed residues ------------------| 262 | | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 263 | | |----------- or 15 2-bit packed residues ----------------| 264 | | 265 | "packtype" bit 30 = 0 if packet is 2-bit packed; 1 if 5-bit packed 266 "sentinel" bit 31 = 1 if last packet in packed sequence; else 0 267 268 (packet & (1 << 31)) tests for end of sequence 269 (packet & (1 << 30)) tests for 5-bit packing vs. 2-bit 270 ((packet >> shift) && 31) decodes 5-bit, for shift=25..0 in steps of 5 271 ((packet >> shift) && 3) decodes 2-bit, for shift=28..0 in steps of 2 272``` 273 274Packets without the sentinel bit set are full. They unpack to 15 or 6 275residues. 276 2775-bit EOD packets may be partial: they unpack to 0..6 residues. The 278remaining residue codes are set to 0x1f (11111), indicating EOD within 279the packet. The only case in which a partial EOD packet encodes 0 280residues is a zero-length sequence: there has to be at least one EOD 281packet. 282 2832-bit EOD packets must be full, because there is no way to signal EOD 284locally within a 2-bit packet. It can't use 0x03 (11), because that 285encodes U/T. Generally, therefore, the last packet(s) of a nucleic 286acid sequence must be 5-bit encoded, solely to be able to use sentinel 287residues in a partial packet, unless the end happens to come flush at 288the end of a 2-bit packet. (If we ever needed to pack an alphabet of 2 289or 3 residues, we could use 0x03 as a sentinel. This seems unlikely 290to ever happen, so I'm simply not going to include any code to read 291EOD 2-bit partial packets.) 292 293A protein sequence of length L packs into exactly P $= \mathrm{max}(1, 294(L+5)/6)$ 5-bit packets. A DNA sequence packs into P $\leq \mathrm{max}(1, 295(L+14)/15)$ mixed 2- and 5-bit packets. P $\geq 1$ because even a 296zero-length sequence ($L=0$) requires an EOD packet. 297 298A packed sequence consists of an integer number of packets, P, ending 299with an EOD packet. 300 301A packed amino acid sequence unpacks to $\leq$ 6P residues. All its 302packets are 5-bit encoded. 303 304A packed nucleic acid sequence unpacks to $\leq$ 15P residues. The 305packets are a mix of 2-bit and 5-bit. Degenerate residues must be 3065-bit packed, and the EOD packet usually is too. A 5-bit packet does 307not have to contain degenerate residues, because it might have been 308necessary to get "in frame" to pack a downstream degenerate 309residue. For example, the sequence ACGTACGTNNA... must be packed as 310[ACGTAC][CGTNNA]... to get the N's packed correctly. 311 312 313