1The \eslmod{dsqdata} module implements a binary sequence data
2format. It accelerates sequence data input in four ways, compared to
3Easel flatfile parsers (\eslmod{sqio}):
4
5\paragraph{Asynchronous input.}
6   Disk and CPU resources are used concurrently, using POSIX threads.
7   A ``loader'' thread does essentially nothing but read chunks of
8   data.  An ``unpacker'' thread does CPU work to prepare loaded
9   sequence data chunks for consumption. If it takes time $R$ to read
10   and $P$ to process the data, instead of overall time $R+P$, with
11   asynchronous input we only need time $\mathrm{MAX}(R,P)$.
12
13\paragraph{Predigitization.}
14  Sequence data in the \eslmod{dsqdata} format are already encoded in
15  Easel digital sequence format.  User-oriented error checking is done
16  up front when the \eslmod{dsqdata} file is created.
17
18\paragraph{Bit packing.}
19  Disk read time is rate-limiting in \eslmod{dsqdata}, so minimizing
20  data volume is critical.  Sequence data are packed bitwise
21  in 32-bit packets to reduce volume by a factor of 1.5 (protein) to
22  3.75 (nucleic). A packet contains six 5-bit residues (protein or
23  degenerate nucleic) or fifteen 2-bit residues (canonical nucleic) and two
24  control bits.
25
26\paragraph{Separate metadata.}
27  Sequence data and metadata (name, accession, description, taxonomy
28  identifier) are stored separately in \ccode{.dsqs} and \ccode{.dsqm}
29  files. This streamlines unpacking, because these data are handled
30  differently.  It also allows a deferred metadata read: sequences may
31  be identified simply by index number during an initial processing
32  sweep, and metadata can be loaded later by random access for a small
33  number of targets of interest.
34
35Table~\ref{tbl:dsqdata_api} lists the functions in the
36\eslmod{dsqdata} API.
37
38% API table is auto generated by the Makefile,
39% using autodoc -t esl_dsqdata.c
40%
41\input{apitables/esl_dsqdata_api}
42
43\subsection{Files in the \eslmodincmd{dsqdata} format}
44
45The format of a database \ccode{mydb} consists of four files:
46
47\vspace{0.5em}
48\begin{tabular}{lll}
49\ccode{mydb}      & Stub      & Human-readable information about the data \\
50\ccode{mydb.dsqi} & Index     & Disk offsets for each seq in metadata and sequence files\\
51\ccode{mydb.dsqm} & Metadata  & Name, accession, description, and taxonomy ids\\
52\ccode{mydb.dsqs} & Sequence  & Sequences (digitized, packed)\\
53\end{tabular}
54\vspace{0.5em}
55
56The database is specified on command lines by the name of the stub
57file (\ccode{mydb}), without any suffix. For example,
58
59\begin{userchunk}
60   % myprogram mydb
61\end{userchunk}
62
63says to open \ccode{mydb}. The \ccode{esl\_dsqdata\_Open()} call then
64opens all four files.
65
66
67\subsection{Definition of the \eslmodincmd{dsqdata} file formats}
68
69\subsubsection{Stub file}
70
71An example stub file:
72
73\begin{cchunk}
74Easel dsqdata v1 x4019752601
75
76Original file:   refprot.fa
77Original format: FASTA
78Type:            amino
79Sequences:       11432138
80Residues:        4358716588
81\end{cchunk}
82
83The first line is parsed by the reader. Its text format matches
84\ccode{/Easel dsqdata v(\textbackslash d+) x(\textbackslash d+)/}.
85The first field is a version number for the format, $\geq 1$. It is
86currently unused, but in the future we might need it to parse
87different versions of the format, if we need to update it. The second
88field is a 32-bit unsigned integer tag in the range
89(0..$2^{32}-1$). Each of the four files carries the same randomly
90generated tag. The tag is used to make sure the four files belong
91together in the same database, as opposed to one or more of them being
92inadvertently clobbered somehow by the user.
93
94After the first line, the rest of the stub file is ignored, and can
95contain anything -- even your own notes, if you want to add any.
96
97\subsubsection{Index file: .dsqi}
98
99The header of the binary index file consists of:
100
101\vspace{0.5em}
102\begin{tabular}{lll}
103\textbf{name} & \textbf{type}    & \textbf{description} \\
104\ccode{magic}         & \ccode{uint32\_t} & magic number (version, byte order)\\
105\ccode{uniquetag}     & \ccode{uint32\_t} & random integer tag (0..$2^{32}-1$)\\
106\ccode{alphatype}     & \ccode{uint32\_t} & alphabet type code (1,2,3 = RNA, DNA, amino)\\
107\ccode{flags}         & \ccode{uint32\_t} & Currently 0. Reserved for future flags\\
108\ccode{max\_namelen}  & \ccode{uint32\_t} & Maximum seq name length in metadata\\
109\ccode{max\_acclen}   & \ccode{uint32\_t} & Maximum accession length in metadata\\
110\ccode{max\_desclen}  & \ccode{uint32\_t} & Maximum description length in metadata\\
111\ccode{max\_seqlen}   & \ccode{uint64\_t} & Maximum sequence length\\
112\ccode{nseq}          & \ccode{uint64\_t} & Total number of sequences in database\\
113\ccode{nres}          & \ccode{uint64\_t} & Total number of residues in database\\
114\end{tabular}
115\vspace{0.5em}
116
117The \textbf{magic} is used to check that the file is indeed a dsqdata
118format file, and to detect byte order swapping. Valid values for the
119magic version/byteorder number are:
120
121\vspace{0.5em}
122\begin{tabular} {lll}
123\textbf{value} & \textbf{derivation}   & \textbf{description}     \\
124\ccode{0xc4d3d1b1}    & ``dsq1'' + 0x80808080 & dsqdata version 1 format \\
125\ccode{0xb1d1d3c4}    & above, byteswapped    & above, byteswapped       \\
126\end{tabular}
127\vspace{0.5em}
128
129The random integer \textbf{uniquetag} must match the tag seen in
130the other files.
131
132The dsqdata packet format is only defined for biological sequence alphabets.
133Valid values for the \textbf{alphatype} code come from a subset of the codes
134used in \ccode{esl\_alphabet.h}:
135\begin{tabular}{lll}
136
137\vspace{0.5em}
138\textbf{integer} & \emcode{esl\_alphabet.h} & \textbf{description} \\
1391 & \ccode{eslRNA}   & RNA        \\
1402 & \ccode{eslDNA}   & DNA        \\
1413 & \ccode{eslAMINO} & protein    \\
142\end{tabular}
143\vspace{0.5em}
144
145The unused \textbf{flags} field gives us some flexibility for future
146versions of the format.
147
148The maximum lengths of the names, accessions, and descriptions in the
149metadata file might someday be useful (in making allocations, for
150example) but they are currently unused.
151
152Likewise, the maximum sequence length, total number of sequences, and
153total number of residues in the database may someday be useful (for
154making decisions about how to partition a parallel search, for
155example), but they are currently unused too.
156
157After the header, the remainder of the file consists of \ccode{nseq}
158records of type \ccode{ESL\_DSQDATA\_RECORD} (defined in
159\ccode{esl\_dsqdata.h}):
160
161\vspace{0.5em}
162\begin{tabular}{lll}
163\textbf{name}         & \textbf{type}    & \textbf{description} \\
164\ccode{metadata\_end} & \ccode{int64\_t}  & Position of terminal \ccode{\textbackslash 0} of metadata for seq i, in bytes\\
165\ccode{psq\_end}      & \ccode{int64\_t}  & Position of final packet for sequence i, in packets\\
166\end{tabular}
167\vspace{0.5em}
168
169Storing \emph{end} positions instead of \emph{start} positions allows
170us to determine lengths, without needing an $N+1$'th sentinel record,
171albeit at the cost of special casing what happens for the first
172sequence $i=0$. For example:
173
174\vspace{0.5em}
175\begin{tabular}{ll}
176Length: &  \ccode{i == 0 ? r[i].end + 1 : r[i].end - r[i-1].end} \\
177Start:  &  \ccode{i == 0 ? 0 : r[i-1].end + 1}\\
178\end{tabular}
179\vspace{0.5em}
180
181This is equivalent to treating \ccode{r[-1].end = -1}. Some of the
182reader's code tracks a \ccode{last\_end} variable for the end of the
183previous metadata or packed sequence field $i-1$, which is initialized
184to -1. This -1 boundary condition is why we use signed
185\ccode{int64\_t} types.
186
187Packet sequence endpoints are stored in units of 32-bit
188\emph{packets}, not in bytes. To convert to a disk offset or a length
189in bytes you multiply by 4 (\ccode{sizeof(uint32\_t)}).
190
191Keeping the size of the dsqdata files as small as possible is critical
192because the reading speed is limited by the raw size of the
193data. Therefore we don't store separate positions for the different
194metadata fields (name/accession/description/taxonomy id), only one
195position for all the metadata associated with sequence $i$. The reader
196reads all of it in one chunk, and parses it for the stored \ccode{\textbackslash 0}
197sentinels.
198
199For the same reason, we don't store any information about
200\emph{unpacked} sequence lengths, only the bare minimum of information
201that the dsqdata loader and unpacker need to locate, load, and unpack
202the packed data for any given sequence $i$. The unpacker determines
203the unpacked sequence length when it unpacks the data.
204
205
206\subsubsection{Metadata file, .dsqm}
207
208The metadata file starts with two header fields, the same two that the
209index file starts with:
210
211\vspace{0.5em}
212\begin{tabular}{lll}
213\textbf{name} & \textbf{type}    & \textbf{description} \\
214\ccode{magic}         & \ccode{uint32\_t} & magic number (version, byte order)\\
215\ccode{uniquetag}     & \ccode{uint32\_t} & random integer tag (0..$2^32-1$)\\
216\end{tabular}
217\vspace{0.5em}
218
219After the header, the remainder of the file consists of the following
220data for each sequence $i =$ \ccode{0..nseq-1}:
221
222\vspace{0.5em}
223\begin{tabular}{lll}
224\textbf{field} & \textbf{type} & \textbf{description} \\
225name         & \ccode{char *}; ends in \ccode{\textbackslash 0} & Sequence name (1 word, no whitespace) \\
226accession    & \ccode{char *}; ends in \ccode{\textbackslash 0} & Sequence accession (1 word, no whitespace)\\
227description  & \ccode{char *}; ends in \ccode{\textbackslash 0} & Sequence description line \\
228taxonomy id  & \ccode{int32\_t}                   & NCBI taxonomy identifier; -1 if none\\
229\end{tabular}
230\vspace{0.5em}
231
232The name, accession, and description are variable length strings. The
233name and accession are single ``words'' with no whitespace
234(\ccode{\textbackslash S+}). The description is one line, may contain spaces, but
235may not contain any newlines. All sequences must have a name, so
236\ccode{strlen(name) > 0}. The accession and description are optional;
237if they are not present, these are 0-length strings (\ccode{"\textbackslash 0"})
238
239The taxonomy identifier is an integer in NCBI's taxonomy. Valid
240taxonomy identifiers are $\geq 1$\footnote{I cannot find any
241  documentation at NCBI on the maximum range of the taxid, nor can I
242  find a clear statement of whether 0 is valid or not. 0 is currently
243  unused in the NCBI taxonomy.  1 indicates the top level. That makes
244  it look like it's safe to treat 0 as ``unset'' but it seems even
245  safer to go with -1 and a signed integer. Unless NCBI ends up having
246  more than two billion species. Currently there are about 1.8
247  million.}
248 This field is optional; use a value of -1 to indicate unset.
249
250These names, types, and semantics match the corresponding fields in an
251\ccode{ESL\_SQ}.
252
253\subsubsection{Sequence file, .dsqs}
254
255The sequence file also starts with the same two header fields that the
256index and metadata files started with:
257
258\vspace{0.5em}
259\begin{tabular}{lll}
260\textbf{name} & \textbf{type}    & \textbf{description} \\
261\ccode{magic}         & \ccode{uint32\_t} & magic number (version, byte order)\\
262\ccode{uniquetag}     & \ccode{uint32\_t} & random integer tag (0..$2^32-1$)\\
263\end{tabular}
264\vspace{0.5em}
265
266After the header, the remainder of the file consists of the packed
267sequences, with one packet array for each sequence $i =$
268\ccode{0..nseq-1}. Each packet array ends with a specially marked
269sentinel packet. The packet format is described next.
270
271\subsubsection{Packet format}
272
273Each packet is an unsigned 32 bit integer.  The two leading (most
274significant) bits are control bits. Bit 31 signals EOD (end of data):
275the last packet in a packed sequence. Bit 30 signals the packet
276format: 1 for 5-bit, 0 for 2-bit.  The remaining bits are the packed
277residue codes:
278
279\begin{asciiart}
280      [31] [30] [29..25]  [24..20]  [19..15]  [14..10]  [ 9..5 ]  [ 4..0 ]
281       ^    ^   |------------  6 5-bit packed residues ------------------|
282       |    |   []  []  []  []  []  []  []  []  []  []  []  []  []  []  []
283       |    |   |----------- or 15 2-bit packed residues ----------------|
284       |    |
285       |    "packtype" bit 30 = 0 if packet is 2-bit packed; 1 if 5-bit packed
286       "sentinel" bit 31 = 1 if last packet in packed sequence; else 0
287
288       (packet & (1 << 31)) tests for end of sequence
289       (packet & (1 << 30)) tests for 5-bit packing vs. 2-bit
290       ((packet >> shift) && 31) decodes 5-bit, for shift=25..0 in steps of 5
291       ((packet >> shift) && 3)  decodes 2-bit, for shift=28..0 in steps of 2
292\end{asciiart}
293
294Packets without the sentinel bit set are full. They unpack to 15 or 6
295residues.
296
2975-bit EOD packets may be partial: they unpack to 0..6 residues. The
298remaining residue codes are set to 0x1f (11111), indicating EOD within
299the packet. The only case in which a partial EOD packet encodes 0
300residues is a zero-length sequence: there has to be at least one EOD
301packet.
302
3032-bit EOD packets must be full, because there is no way to signal EOD
304locally within a 2-bit packet. It can't use 0x03 (11), because that
305encodes U/T. Generally, therefore, the last packet(s) of a nucleic
306acid sequence must be 5-bit encoded, solely to be able to use sentinel
307residues in a partial packet, unless the end happens to come flush at
308the end of a 2-bit packet.\footnote{If we ever needed to pack an
309  alphabet of 2 or 3 residues, we could use 0x03 as a sentinel.  This
310  seems unlikely to ever happen, so I'm simply not going to include
311  any code to read EOD 2-bit partial packets.}
312
313A protein sequence of length $L$ packs into exactly P $= MAX(1,
314(L+5)/6)$ 5-bit packets. A DNA sequence packs into P $\leq MAX(1,
315(L+14)/15)$ mixed 2- and 5-bit packets. P $\geq 1$ because even a
316zero-length sequence ($L=0$) requires an EOD packet.
317
318A packed sequence consists of an integer number of packets, P, ending
319with an EOD packet.
320
321A packed amino acid sequence unpacks to $\leq$ 6P residues. All its
322packets are 5-bit encoded.
323
324A packed nucleic acid sequence unpacks to $\leq$ 15P residues.  The
325packets are a mix of 2-bit and 5-bit. Degenerate residues must be
3265-bit packed, and the EOD packet usually is too. A 5-bit packet does
327not have to contain degenerate residues, because it might have been
328necessary to get ``in frame'' to pack a downstream degenerate
329residue. For example, the sequence ACGTACGTNNA... must be packed as
330[ACGTAC][CGTNNA]... to get the N's packed correctly.
331
332
333