1The \eslmod{dsqdata} module implements a binary sequence data 2format. It accelerates sequence data input in four ways, compared to 3Easel flatfile parsers (\eslmod{sqio}): 4 5\paragraph{Asynchronous input.} 6 Disk and CPU resources are used concurrently, using POSIX threads. 7 A ``loader'' thread does essentially nothing but read chunks of 8 data. An ``unpacker'' thread does CPU work to prepare loaded 9 sequence data chunks for consumption. If it takes time $R$ to read 10 and $P$ to process the data, instead of overall time $R+P$, with 11 asynchronous input we only need time $\mathrm{MAX}(R,P)$. 12 13\paragraph{Predigitization.} 14 Sequence data in the \eslmod{dsqdata} format are already encoded in 15 Easel digital sequence format. User-oriented error checking is done 16 up front when the \eslmod{dsqdata} file is created. 17 18\paragraph{Bit packing.} 19 Disk read time is rate-limiting in \eslmod{dsqdata}, so minimizing 20 data volume is critical. Sequence data are packed bitwise 21 in 32-bit packets to reduce volume by a factor of 1.5 (protein) to 22 3.75 (nucleic). A packet contains six 5-bit residues (protein or 23 degenerate nucleic) or fifteen 2-bit residues (canonical nucleic) and two 24 control bits. 25 26\paragraph{Separate metadata.} 27 Sequence data and metadata (name, accession, description, taxonomy 28 identifier) are stored separately in \ccode{.dsqs} and \ccode{.dsqm} 29 files. This streamlines unpacking, because these data are handled 30 differently. It also allows a deferred metadata read: sequences may 31 be identified simply by index number during an initial processing 32 sweep, and metadata can be loaded later by random access for a small 33 number of targets of interest. 34 35Table~\ref{tbl:dsqdata_api} lists the functions in the 36\eslmod{dsqdata} API. 37 38% API table is auto generated by the Makefile, 39% using autodoc -t esl_dsqdata.c 40% 41\input{apitables/esl_dsqdata_api} 42 43\subsection{Files in the \eslmodincmd{dsqdata} format} 44 45The format of a database \ccode{mydb} consists of four files: 46 47\vspace{0.5em} 48\begin{tabular}{lll} 49\ccode{mydb} & Stub & Human-readable information about the data \\ 50\ccode{mydb.dsqi} & Index & Disk offsets for each seq in metadata and sequence files\\ 51\ccode{mydb.dsqm} & Metadata & Name, accession, description, and taxonomy ids\\ 52\ccode{mydb.dsqs} & Sequence & Sequences (digitized, packed)\\ 53\end{tabular} 54\vspace{0.5em} 55 56The database is specified on command lines by the name of the stub 57file (\ccode{mydb}), without any suffix. For example, 58 59\begin{userchunk} 60 % myprogram mydb 61\end{userchunk} 62 63says to open \ccode{mydb}. The \ccode{esl\_dsqdata\_Open()} call then 64opens all four files. 65 66 67\subsection{Definition of the \eslmodincmd{dsqdata} file formats} 68 69\subsubsection{Stub file} 70 71An example stub file: 72 73\begin{cchunk} 74Easel dsqdata v1 x4019752601 75 76Original file: refprot.fa 77Original format: FASTA 78Type: amino 79Sequences: 11432138 80Residues: 4358716588 81\end{cchunk} 82 83The first line is parsed by the reader. Its text format matches 84\ccode{/Easel dsqdata v(\textbackslash d+) x(\textbackslash d+)/}. 85The first field is a version number for the format, $\geq 1$. It is 86currently unused, but in the future we might need it to parse 87different versions of the format, if we need to update it. The second 88field is a 32-bit unsigned integer tag in the range 89(0..$2^{32}-1$). Each of the four files carries the same randomly 90generated tag. The tag is used to make sure the four files belong 91together in the same database, as opposed to one or more of them being 92inadvertently clobbered somehow by the user. 93 94After the first line, the rest of the stub file is ignored, and can 95contain anything -- even your own notes, if you want to add any. 96 97\subsubsection{Index file: .dsqi} 98 99The header of the binary index file consists of: 100 101\vspace{0.5em} 102\begin{tabular}{lll} 103\textbf{name} & \textbf{type} & \textbf{description} \\ 104\ccode{magic} & \ccode{uint32\_t} & magic number (version, byte order)\\ 105\ccode{uniquetag} & \ccode{uint32\_t} & random integer tag (0..$2^{32}-1$)\\ 106\ccode{alphatype} & \ccode{uint32\_t} & alphabet type code (1,2,3 = RNA, DNA, amino)\\ 107\ccode{flags} & \ccode{uint32\_t} & Currently 0. Reserved for future flags\\ 108\ccode{max\_namelen} & \ccode{uint32\_t} & Maximum seq name length in metadata\\ 109\ccode{max\_acclen} & \ccode{uint32\_t} & Maximum accession length in metadata\\ 110\ccode{max\_desclen} & \ccode{uint32\_t} & Maximum description length in metadata\\ 111\ccode{max\_seqlen} & \ccode{uint64\_t} & Maximum sequence length\\ 112\ccode{nseq} & \ccode{uint64\_t} & Total number of sequences in database\\ 113\ccode{nres} & \ccode{uint64\_t} & Total number of residues in database\\ 114\end{tabular} 115\vspace{0.5em} 116 117The \textbf{magic} is used to check that the file is indeed a dsqdata 118format file, and to detect byte order swapping. Valid values for the 119magic version/byteorder number are: 120 121\vspace{0.5em} 122\begin{tabular} {lll} 123\textbf{value} & \textbf{derivation} & \textbf{description} \\ 124\ccode{0xc4d3d1b1} & ``dsq1'' + 0x80808080 & dsqdata version 1 format \\ 125\ccode{0xb1d1d3c4} & above, byteswapped & above, byteswapped \\ 126\end{tabular} 127\vspace{0.5em} 128 129The random integer \textbf{uniquetag} must match the tag seen in 130the other files. 131 132The dsqdata packet format is only defined for biological sequence alphabets. 133Valid values for the \textbf{alphatype} code come from a subset of the codes 134used in \ccode{esl\_alphabet.h}: 135\begin{tabular}{lll} 136 137\vspace{0.5em} 138\textbf{integer} & \emcode{esl\_alphabet.h} & \textbf{description} \\ 1391 & \ccode{eslRNA} & RNA \\ 1402 & \ccode{eslDNA} & DNA \\ 1413 & \ccode{eslAMINO} & protein \\ 142\end{tabular} 143\vspace{0.5em} 144 145The unused \textbf{flags} field gives us some flexibility for future 146versions of the format. 147 148The maximum lengths of the names, accessions, and descriptions in the 149metadata file might someday be useful (in making allocations, for 150example) but they are currently unused. 151 152Likewise, the maximum sequence length, total number of sequences, and 153total number of residues in the database may someday be useful (for 154making decisions about how to partition a parallel search, for 155example), but they are currently unused too. 156 157After the header, the remainder of the file consists of \ccode{nseq} 158records of type \ccode{ESL\_DSQDATA\_RECORD} (defined in 159\ccode{esl\_dsqdata.h}): 160 161\vspace{0.5em} 162\begin{tabular}{lll} 163\textbf{name} & \textbf{type} & \textbf{description} \\ 164\ccode{metadata\_end} & \ccode{int64\_t} & Position of terminal \ccode{\textbackslash 0} of metadata for seq i, in bytes\\ 165\ccode{psq\_end} & \ccode{int64\_t} & Position of final packet for sequence i, in packets\\ 166\end{tabular} 167\vspace{0.5em} 168 169Storing \emph{end} positions instead of \emph{start} positions allows 170us to determine lengths, without needing an $N+1$'th sentinel record, 171albeit at the cost of special casing what happens for the first 172sequence $i=0$. For example: 173 174\vspace{0.5em} 175\begin{tabular}{ll} 176Length: & \ccode{i == 0 ? r[i].end + 1 : r[i].end - r[i-1].end} \\ 177Start: & \ccode{i == 0 ? 0 : r[i-1].end + 1}\\ 178\end{tabular} 179\vspace{0.5em} 180 181This is equivalent to treating \ccode{r[-1].end = -1}. Some of the 182reader's code tracks a \ccode{last\_end} variable for the end of the 183previous metadata or packed sequence field $i-1$, which is initialized 184to -1. This -1 boundary condition is why we use signed 185\ccode{int64\_t} types. 186 187Packet sequence endpoints are stored in units of 32-bit 188\emph{packets}, not in bytes. To convert to a disk offset or a length 189in bytes you multiply by 4 (\ccode{sizeof(uint32\_t)}). 190 191Keeping the size of the dsqdata files as small as possible is critical 192because the reading speed is limited by the raw size of the 193data. Therefore we don't store separate positions for the different 194metadata fields (name/accession/description/taxonomy id), only one 195position for all the metadata associated with sequence $i$. The reader 196reads all of it in one chunk, and parses it for the stored \ccode{\textbackslash 0} 197sentinels. 198 199For the same reason, we don't store any information about 200\emph{unpacked} sequence lengths, only the bare minimum of information 201that the dsqdata loader and unpacker need to locate, load, and unpack 202the packed data for any given sequence $i$. The unpacker determines 203the unpacked sequence length when it unpacks the data. 204 205 206\subsubsection{Metadata file, .dsqm} 207 208The metadata file starts with two header fields, the same two that the 209index file starts with: 210 211\vspace{0.5em} 212\begin{tabular}{lll} 213\textbf{name} & \textbf{type} & \textbf{description} \\ 214\ccode{magic} & \ccode{uint32\_t} & magic number (version, byte order)\\ 215\ccode{uniquetag} & \ccode{uint32\_t} & random integer tag (0..$2^32-1$)\\ 216\end{tabular} 217\vspace{0.5em} 218 219After the header, the remainder of the file consists of the following 220data for each sequence $i =$ \ccode{0..nseq-1}: 221 222\vspace{0.5em} 223\begin{tabular}{lll} 224\textbf{field} & \textbf{type} & \textbf{description} \\ 225name & \ccode{char *}; ends in \ccode{\textbackslash 0} & Sequence name (1 word, no whitespace) \\ 226accession & \ccode{char *}; ends in \ccode{\textbackslash 0} & Sequence accession (1 word, no whitespace)\\ 227description & \ccode{char *}; ends in \ccode{\textbackslash 0} & Sequence description line \\ 228taxonomy id & \ccode{int32\_t} & NCBI taxonomy identifier; -1 if none\\ 229\end{tabular} 230\vspace{0.5em} 231 232The name, accession, and description are variable length strings. The 233name and accession are single ``words'' with no whitespace 234(\ccode{\textbackslash S+}). The description is one line, may contain spaces, but 235may not contain any newlines. All sequences must have a name, so 236\ccode{strlen(name) > 0}. The accession and description are optional; 237if they are not present, these are 0-length strings (\ccode{"\textbackslash 0"}) 238 239The taxonomy identifier is an integer in NCBI's taxonomy. Valid 240taxonomy identifiers are $\geq 1$\footnote{I cannot find any 241 documentation at NCBI on the maximum range of the taxid, nor can I 242 find a clear statement of whether 0 is valid or not. 0 is currently 243 unused in the NCBI taxonomy. 1 indicates the top level. That makes 244 it look like it's safe to treat 0 as ``unset'' but it seems even 245 safer to go with -1 and a signed integer. Unless NCBI ends up having 246 more than two billion species. Currently there are about 1.8 247 million.} 248 This field is optional; use a value of -1 to indicate unset. 249 250These names, types, and semantics match the corresponding fields in an 251\ccode{ESL\_SQ}. 252 253\subsubsection{Sequence file, .dsqs} 254 255The sequence file also starts with the same two header fields that the 256index and metadata files started with: 257 258\vspace{0.5em} 259\begin{tabular}{lll} 260\textbf{name} & \textbf{type} & \textbf{description} \\ 261\ccode{magic} & \ccode{uint32\_t} & magic number (version, byte order)\\ 262\ccode{uniquetag} & \ccode{uint32\_t} & random integer tag (0..$2^32-1$)\\ 263\end{tabular} 264\vspace{0.5em} 265 266After the header, the remainder of the file consists of the packed 267sequences, with one packet array for each sequence $i =$ 268\ccode{0..nseq-1}. Each packet array ends with a specially marked 269sentinel packet. The packet format is described next. 270 271\subsubsection{Packet format} 272 273Each packet is an unsigned 32 bit integer. The two leading (most 274significant) bits are control bits. Bit 31 signals EOD (end of data): 275the last packet in a packed sequence. Bit 30 signals the packet 276format: 1 for 5-bit, 0 for 2-bit. The remaining bits are the packed 277residue codes: 278 279\begin{asciiart} 280 [31] [30] [29..25] [24..20] [19..15] [14..10] [ 9..5 ] [ 4..0 ] 281 ^ ^ |------------ 6 5-bit packed residues ------------------| 282 | | [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] 283 | | |----------- or 15 2-bit packed residues ----------------| 284 | | 285 | "packtype" bit 30 = 0 if packet is 2-bit packed; 1 if 5-bit packed 286 "sentinel" bit 31 = 1 if last packet in packed sequence; else 0 287 288 (packet & (1 << 31)) tests for end of sequence 289 (packet & (1 << 30)) tests for 5-bit packing vs. 2-bit 290 ((packet >> shift) && 31) decodes 5-bit, for shift=25..0 in steps of 5 291 ((packet >> shift) && 3) decodes 2-bit, for shift=28..0 in steps of 2 292\end{asciiart} 293 294Packets without the sentinel bit set are full. They unpack to 15 or 6 295residues. 296 2975-bit EOD packets may be partial: they unpack to 0..6 residues. The 298remaining residue codes are set to 0x1f (11111), indicating EOD within 299the packet. The only case in which a partial EOD packet encodes 0 300residues is a zero-length sequence: there has to be at least one EOD 301packet. 302 3032-bit EOD packets must be full, because there is no way to signal EOD 304locally within a 2-bit packet. It can't use 0x03 (11), because that 305encodes U/T. Generally, therefore, the last packet(s) of a nucleic 306acid sequence must be 5-bit encoded, solely to be able to use sentinel 307residues in a partial packet, unless the end happens to come flush at 308the end of a 2-bit packet.\footnote{If we ever needed to pack an 309 alphabet of 2 or 3 residues, we could use 0x03 as a sentinel. This 310 seems unlikely to ever happen, so I'm simply not going to include 311 any code to read EOD 2-bit partial packets.} 312 313A protein sequence of length $L$ packs into exactly P $= MAX(1, 314(L+5)/6)$ 5-bit packets. A DNA sequence packs into P $\leq MAX(1, 315(L+14)/15)$ mixed 2- and 5-bit packets. P $\geq 1$ because even a 316zero-length sequence ($L=0$) requires an EOD packet. 317 318A packed sequence consists of an integer number of packets, P, ending 319with an EOD packet. 320 321A packed amino acid sequence unpacks to $\leq$ 6P residues. All its 322packets are 5-bit encoded. 323 324A packed nucleic acid sequence unpacks to $\leq$ 15P residues. The 325packets are a mix of 2-bit and 5-bit. Degenerate residues must be 3265-bit packed, and the EOD packet usually is too. A 5-bit packet does 327not have to contain degenerate residues, because it might have been 328necessary to get ``in frame'' to pack a downstream degenerate 329residue. For example, the sequence ACGTACGTNNA... must be packed as 330[ACGTAC][CGTNNA]... to get the N's packed correctly. 331 332 333