1
2The NCBI BLAST databases are generated by the program \emph{formatdb}.
3The three files needed by Easel are index file, sequence file
4and header file.For protein databases these files end with the extension
5".pin", ".psq" and ".phr" respectively.  For DNA databases the
6extensions are ".nin", ".nsq" and ".nhr" respectively.  The index
7file contains information about the database, i.e. version number,
8database type, file offsets, etc.  The sequence file contains residues
9for each of the sequences.  Finally, the header file contains the header
10information for each of the sequences.  This document describes the
11structure of the NCBI version 4 database.
12
13If these files cannot be found, an alias file, extensions ".nal" or
14 ".pal", is processed.  The alias file is used to specify multiple
15volumes when databases are larger than 2 GB.
16
17
18\subsection{Index File (*.pin, *.nin)}
19
20The index file contains format information about the database.   The
21layout of the version 4 index file is below:
22
23\bigskip
24\begin{center}
25\begin{tabular}{|l|l|p{3.5in}|} \hline
26Version &
27Int32 &
28Version number. \\ \hline
29Database type &
30Int32 &
310-DNA 1-Protein.  \\ \hline
32Title length &
33Int32 &
34Length of the title string (\emph{T}).  \\ \hline
35Title &
36Char[\emph{T}] &
37Title string.  \\ \hline
38Timestamp length &
39Int32 &
40Length of the timestamp string (\emph{S}).  \\ \hline
41Timestamp &Char[\emph{S}] &
42Time of the database creation.  The length of the timestamp \emph{S} is
43increased to force 8 byte alignment of the remaining integer
44fields.  The timestamp is padded with NULs to achieve this alignment.   \\ \hline
45Number of sequences &
46Int32 &
47Number of sequences in the database (\emph{N})  \\ \hline
48Residue count &
49Int64 &
50Total number of residues in the database.  Note:  Unlike other integer
51fields, this field is stored in little endian.  \\ \hline
52Maximum sequence &
53Int32 &
54Length of the longest sequence in the database  \\ \hline
55Header offset table &
56Int32[\emph{N+1}] &
57Offsets into the sequence's header file (*.phr, *nhr). \\ \hline
58Sequence offset table &
59Int32[\emph{N+1}] &
60Offsets into the sequence's residue file (*.psq, *.nsq). \\ \hline
61Ambiguity offset table &
62Int32[\emph{N+1}] &
63Offset into the sequence's residue file (*.nsq).  Note: This table is only
64in DNA databases.  If the sequence does not have any ambiguity
65residues, then the offset points to the beginning of the next
66sequence.  \\ \hline
67\end{tabular}
68\end{center}
69\bigskip
70
71The integer fields
72are stored in big endian format, except for the residue count which is
73stored in little endian.  The two string fields, timestamp and title
74are preceded by a 32 bit length field.  The title string is not NUL
75terminated.  If the end of the timestamp field does not end on an offset
76that is a multiple of 8 bytes, NUL characters are padded to the end of
77the string to bring it to a multiple of 8 bytes.  This forces all the
78following integer fields to be aligned on a 4-byte boundary for
79performance reasons.  The length of the timestamp field reflects the
80NUL padding if any.  The header offset table is a list of offsets to
81the beginning of each sequence's header.  These are offsets into
82the header file (*.phr, *.nhr).  The size of the header can be
83calculated by subtracting the offset of the next header from the
84current header.
85The sequence offset table is a list of offsets to
86the beginning of each sequence's residue data.  These are offsets into
87the sequence file (*.psq, *.nsq).  The size of the sequence can be
88calculated by subtracting the offset of the next sequence from the
89current sequence.
90Since one more offset is stored than the number
91of sequences in the database, no special code is needed in calculating
92the header size or sequence size for the last entry in the database.
93
94
95\subsection{Protein Sequence File (*.pin)}
96
97The sequence file contains the sequences, one after another.  The
98sequences are in a binary format separated by a NUL byte.  Each
99residue is encoded in eight bits.
100
101\bigskip
102\begin{center}
103\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline
104Amino acid & Value & Amino acid & Value & Amino acid & Value & Amino acid & Value \\ \hline
105- & 0 & G &  7 & N & 13 & U & 24 \\ \hline
106A & 1 & H &  8 & O & 26 & V & 19 \\ \hline
107B & 2 & I &  9 & P & 14 & W & 20 \\ \hline
108C & 3 & J & 27 & Q & 15 & X & 21 \\ \hline
109D & 4 & K & 10 & R & 16 & Y & 22 \\ \hline
110E & 5 & L & 11 & S & 17 & Z & 23 \\ \hline
111F & 6 & M & 12 & T & 18 & * & 25 \\ \hline
112\end{tabular}
113\end{center}
114
115
116\subsection{DNA Sequence File (*.nsq)}
117
118The sequence file contains the sequences, one after another.  The
119sequences are in a binary format but unlike the protein sequence
120file, the sequences are not separated by a NUL byte.  The
121sequence is first compressed using two bits per residue then
122followed by an ambiguity correction table if
123necessary.  If the sequence does not have an ambiguity table,
124the sequence's ambiguity index points to the beginning of the
125next sequence.
126
127\subsubsection{Two-bit encoding}
128
129The sequence is encoded first using two bits per nucleotide.
130
131\bigskip
132\begin{center}
133\begin{tabular}{|c|c|c|} \hline
134Nucleotide & Value & Binary \\ \hline
135A & 0 & 00 \\ \hline
136C & 1 & 01 \\ \hline
137G & 2 & 10 \\ \hline
138T or U & 3 & 11 \\ \hline
139\end{tabular}
140\end{center}
141\bigskip
142
143Any
144ambiguous residues are replaced by an 'A', 'C', 'G' or 'T' in
145the two bit encoding.  To calculate the number of residues
146in the sequence, the least significant two bits in the
147last byte of the sequence needs to be examined.
148These last two bits indicate how many residues, if any, are
149encoded in the most significant bits of the last byte.
150
151
152\subsubsection{Ambiguity Table}
153
154To correct a sequence containing any degenerate residues, an
155ambiguity table follows the two bit encoded string.
156The start of the ambiguity table is
157pointed to by the ambiguity table index in the index file,
158"*.nin".  The first four bytes contains the number of 32 bit
159words in the correction table.  If the most significant bit
160is set in the count, then two 32 bit entries will be used for
161each correction.
162The 64 bit entries are used for sequence with
163more than 16 million residues.  Each correction contains three
164pieces of
165information, the actual encoded nucleotide, how many nucleotides
166in the sequence are replaced by the correct nucleotide and finally
167the offset into the sequences to apply the correction.
168
169For 32 bit
170entries, the first 4 most significant bits encodes the nucleotide.
171Their bit pattern is
172true of their representation, i.e. the value of 'H' is equal
173to ('A'~or~'T'~or~'C').
174
175\bigskip
176\begin{center}
177\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline
178Nucleotide & Value & Nucleotide & Value & Nucleotide & Value & Nucleotide & Value \\ \hline
179- & 0 & G & 4 & T & 8  & K & 12 \\ \hline
180A & 1 & R & 5 & W & 9  & D & 13 \\ \hline
181C & 2 & S & 6 & Y & 10 & B & 14 \\ \hline
182M & 3 & V & 7 & H & 11 & N & 15 \\ \hline
183\end{tabular}
184\end{center}
185\bigskip
186
187The next field is the repeat count which is four bits wide.
188One is added to the count giving it the range of 1 -- 256.
189The last 24 bits is the offset into the sequence where the
190replacement starts.  The first residue start at offset zero,
191the second at offset one, etc.  With a 24 bit size, the offset
192can only address sequences around 16 million residues long.
193
194To address larger sequences, 64 bit
195entries are used.  For 64 bit entries, the order of the entries stays the same,
196but their sizes change.  The nucleotide remains at four bits.
197The repeat count is increased to 12 bits giving it the range
198of 1 -- 4096.  The offset size is increased to 48 bits.
199
200
201\subsection{Header File (*.phr, *.nhr)}
202
203The header file contains the headers for each sequence, one after another.
204The sequences are in a binary encoded ASN.1 format.  The length
205of a header can be calculated by subtracting the offset of the
206next sequence from the current sequence offset.  The ASN.1 definition
207for the headers can be found in the NCBI toolkit in the following
208files: asn.all and fastadl.asn.
209
210The parsing of the header can be done with a simple recursive
211descent parser.  The five basic types defined in the header are:
212
213\begin{itemize}
214\item Integer -- a variable length integer value.
215\item VisibleString -- a variable length string.
216\item Choice -- a union of one or more alternatives.
217\item Sequence -- an ordered collection of one or more types.
218\item SequenceOf -- an ordered collection of zero or more occurrences
219of a given type.
220\end{itemize}
221
222\subsubsection{Integer}
223
224The first byte of an encoded integer is a hex \verb+02+.  The next byte
225is the number of bytes used to encode the integer value.  The
226remaining bytes are the actual value.  The value is encoded
227most significant byte first.
228
229\subsubsection{VisibleString}
230
231The first byte of a visible string is a hex \verb+1A+.
232The next byte
233starts encoding the length of the string.  If the most
234significant bit is off, then the lower seven bits encode the
235length of the string, i.e. the string has a length less than 128.
236If the most significant bit is on, then
237the lower seven bits is the number of bytes that hold the length of
238the string, then the bytes encoding the string length, most significant
239bytes first.
240Following the length are the actual string characters.
241The strings are not NUL terminated.
242
243\subsubsection{Choice}
244
245
246The first byte indicates which selection of the choice.  The choices
247start with a hex value \verb+A0+ for the first item, \verb+A1+ for
248the second, etc.
249The selection is followed by a hex \verb+80+.  Two NUL bytes follow
250the choice.
251
252\subsubsection{Sequence}
253
254The first two bytes are a hex \verb+3080+.  The header is
255then followed by
256the encoded sequence types.  The first two bytes indicates which type
257of the sequence is encoded.  This index starts with the hex value
258\verb+A080+
259for the first item, \verb+A180+ for the second, etc. then
260followed by the
261encoded item and finally two NUL bytes, \verb+0000+, to indicate the end
262of that type.  The next type in the sequence is then encoded.  If an
263item is optional and is not defined, then none of it is encoded
264including the index and NUL bytes.  This is repeated until the entire
265sequence has been encoded.  Two NUL bytes then mark the end of the
266sequence.
267
268\subsubsection{SequenceOf}
269
270The first two bytes are a hex \verb+3080+.  Then the lists of objects are
271encoded.  Two NUL bytes encode the end of the list.
272