1 2The NCBI BLAST databases are generated by the program \emph{formatdb}. 3The three files needed by Easel are index file, sequence file 4and header file.For protein databases these files end with the extension 5".pin", ".psq" and ".phr" respectively. For DNA databases the 6extensions are ".nin", ".nsq" and ".nhr" respectively. The index 7file contains information about the database, i.e. version number, 8database type, file offsets, etc. The sequence file contains residues 9for each of the sequences. Finally, the header file contains the header 10information for each of the sequences. This document describes the 11structure of the NCBI version 4 database. 12 13If these files cannot be found, an alias file, extensions ".nal" or 14 ".pal", is processed. The alias file is used to specify multiple 15volumes when databases are larger than 2 GB. 16 17 18\subsection{Index File (*.pin, *.nin)} 19 20The index file contains format information about the database. The 21layout of the version 4 index file is below: 22 23\bigskip 24\begin{center} 25\begin{tabular}{|l|l|p{3.5in}|} \hline 26Version & 27Int32 & 28Version number. \\ \hline 29Database type & 30Int32 & 310-DNA 1-Protein. \\ \hline 32Title length & 33Int32 & 34Length of the title string (\emph{T}). \\ \hline 35Title & 36Char[\emph{T}] & 37Title string. \\ \hline 38Timestamp length & 39Int32 & 40Length of the timestamp string (\emph{S}). \\ \hline 41Timestamp &Char[\emph{S}] & 42Time of the database creation. The length of the timestamp \emph{S} is 43increased to force 8 byte alignment of the remaining integer 44fields. The timestamp is padded with NULs to achieve this alignment. \\ \hline 45Number of sequences & 46Int32 & 47Number of sequences in the database (\emph{N}) \\ \hline 48Residue count & 49Int64 & 50Total number of residues in the database. Note: Unlike other integer 51fields, this field is stored in little endian. \\ \hline 52Maximum sequence & 53Int32 & 54Length of the longest sequence in the database \\ \hline 55Header offset table & 56Int32[\emph{N+1}] & 57Offsets into the sequence's header file (*.phr, *nhr). \\ \hline 58Sequence offset table & 59Int32[\emph{N+1}] & 60Offsets into the sequence's residue file (*.psq, *.nsq). \\ \hline 61Ambiguity offset table & 62Int32[\emph{N+1}] & 63Offset into the sequence's residue file (*.nsq). Note: This table is only 64in DNA databases. If the sequence does not have any ambiguity 65residues, then the offset points to the beginning of the next 66sequence. \\ \hline 67\end{tabular} 68\end{center} 69\bigskip 70 71The integer fields 72are stored in big endian format, except for the residue count which is 73stored in little endian. The two string fields, timestamp and title 74are preceded by a 32 bit length field. The title string is not NUL 75terminated. If the end of the timestamp field does not end on an offset 76that is a multiple of 8 bytes, NUL characters are padded to the end of 77the string to bring it to a multiple of 8 bytes. This forces all the 78following integer fields to be aligned on a 4-byte boundary for 79performance reasons. The length of the timestamp field reflects the 80NUL padding if any. The header offset table is a list of offsets to 81the beginning of each sequence's header. These are offsets into 82the header file (*.phr, *.nhr). The size of the header can be 83calculated by subtracting the offset of the next header from the 84current header. 85The sequence offset table is a list of offsets to 86the beginning of each sequence's residue data. These are offsets into 87the sequence file (*.psq, *.nsq). The size of the sequence can be 88calculated by subtracting the offset of the next sequence from the 89current sequence. 90Since one more offset is stored than the number 91of sequences in the database, no special code is needed in calculating 92the header size or sequence size for the last entry in the database. 93 94 95\subsection{Protein Sequence File (*.pin)} 96 97The sequence file contains the sequences, one after another. The 98sequences are in a binary format separated by a NUL byte. Each 99residue is encoded in eight bits. 100 101\bigskip 102\begin{center} 103\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline 104Amino acid & Value & Amino acid & Value & Amino acid & Value & Amino acid & Value \\ \hline 105- & 0 & G & 7 & N & 13 & U & 24 \\ \hline 106A & 1 & H & 8 & O & 26 & V & 19 \\ \hline 107B & 2 & I & 9 & P & 14 & W & 20 \\ \hline 108C & 3 & J & 27 & Q & 15 & X & 21 \\ \hline 109D & 4 & K & 10 & R & 16 & Y & 22 \\ \hline 110E & 5 & L & 11 & S & 17 & Z & 23 \\ \hline 111F & 6 & M & 12 & T & 18 & * & 25 \\ \hline 112\end{tabular} 113\end{center} 114 115 116\subsection{DNA Sequence File (*.nsq)} 117 118The sequence file contains the sequences, one after another. The 119sequences are in a binary format but unlike the protein sequence 120file, the sequences are not separated by a NUL byte. The 121sequence is first compressed using two bits per residue then 122followed by an ambiguity correction table if 123necessary. If the sequence does not have an ambiguity table, 124the sequence's ambiguity index points to the beginning of the 125next sequence. 126 127\subsubsection{Two-bit encoding} 128 129The sequence is encoded first using two bits per nucleotide. 130 131\bigskip 132\begin{center} 133\begin{tabular}{|c|c|c|} \hline 134Nucleotide & Value & Binary \\ \hline 135A & 0 & 00 \\ \hline 136C & 1 & 01 \\ \hline 137G & 2 & 10 \\ \hline 138T or U & 3 & 11 \\ \hline 139\end{tabular} 140\end{center} 141\bigskip 142 143Any 144ambiguous residues are replaced by an 'A', 'C', 'G' or 'T' in 145the two bit encoding. To calculate the number of residues 146in the sequence, the least significant two bits in the 147last byte of the sequence needs to be examined. 148These last two bits indicate how many residues, if any, are 149encoded in the most significant bits of the last byte. 150 151 152\subsubsection{Ambiguity Table} 153 154To correct a sequence containing any degenerate residues, an 155ambiguity table follows the two bit encoded string. 156The start of the ambiguity table is 157pointed to by the ambiguity table index in the index file, 158"*.nin". The first four bytes contains the number of 32 bit 159words in the correction table. If the most significant bit 160is set in the count, then two 32 bit entries will be used for 161each correction. 162The 64 bit entries are used for sequence with 163more than 16 million residues. Each correction contains three 164pieces of 165information, the actual encoded nucleotide, how many nucleotides 166in the sequence are replaced by the correct nucleotide and finally 167the offset into the sequences to apply the correction. 168 169For 32 bit 170entries, the first 4 most significant bits encodes the nucleotide. 171Their bit pattern is 172true of their representation, i.e. the value of 'H' is equal 173to ('A'~or~'T'~or~'C'). 174 175\bigskip 176\begin{center} 177\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline 178Nucleotide & Value & Nucleotide & Value & Nucleotide & Value & Nucleotide & Value \\ \hline 179- & 0 & G & 4 & T & 8 & K & 12 \\ \hline 180A & 1 & R & 5 & W & 9 & D & 13 \\ \hline 181C & 2 & S & 6 & Y & 10 & B & 14 \\ \hline 182M & 3 & V & 7 & H & 11 & N & 15 \\ \hline 183\end{tabular} 184\end{center} 185\bigskip 186 187The next field is the repeat count which is four bits wide. 188One is added to the count giving it the range of 1 -- 256. 189The last 24 bits is the offset into the sequence where the 190replacement starts. The first residue start at offset zero, 191the second at offset one, etc. With a 24 bit size, the offset 192can only address sequences around 16 million residues long. 193 194To address larger sequences, 64 bit 195entries are used. For 64 bit entries, the order of the entries stays the same, 196but their sizes change. The nucleotide remains at four bits. 197The repeat count is increased to 12 bits giving it the range 198of 1 -- 4096. The offset size is increased to 48 bits. 199 200 201\subsection{Header File (*.phr, *.nhr)} 202 203The header file contains the headers for each sequence, one after another. 204The sequences are in a binary encoded ASN.1 format. The length 205of a header can be calculated by subtracting the offset of the 206next sequence from the current sequence offset. The ASN.1 definition 207for the headers can be found in the NCBI toolkit in the following 208files: asn.all and fastadl.asn. 209 210The parsing of the header can be done with a simple recursive 211descent parser. The five basic types defined in the header are: 212 213\begin{itemize} 214\item Integer -- a variable length integer value. 215\item VisibleString -- a variable length string. 216\item Choice -- a union of one or more alternatives. 217\item Sequence -- an ordered collection of one or more types. 218\item SequenceOf -- an ordered collection of zero or more occurrences 219of a given type. 220\end{itemize} 221 222\subsubsection{Integer} 223 224The first byte of an encoded integer is a hex \verb+02+. The next byte 225is the number of bytes used to encode the integer value. The 226remaining bytes are the actual value. The value is encoded 227most significant byte first. 228 229\subsubsection{VisibleString} 230 231The first byte of a visible string is a hex \verb+1A+. 232The next byte 233starts encoding the length of the string. If the most 234significant bit is off, then the lower seven bits encode the 235length of the string, i.e. the string has a length less than 128. 236If the most significant bit is on, then 237the lower seven bits is the number of bytes that hold the length of 238the string, then the bytes encoding the string length, most significant 239bytes first. 240Following the length are the actual string characters. 241The strings are not NUL terminated. 242 243\subsubsection{Choice} 244 245 246The first byte indicates which selection of the choice. The choices 247start with a hex value \verb+A0+ for the first item, \verb+A1+ for 248the second, etc. 249The selection is followed by a hex \verb+80+. Two NUL bytes follow 250the choice. 251 252\subsubsection{Sequence} 253 254The first two bytes are a hex \verb+3080+. The header is 255then followed by 256the encoded sequence types. The first two bytes indicates which type 257of the sequence is encoded. This index starts with the hex value 258\verb+A080+ 259for the first item, \verb+A180+ for the second, etc. then 260followed by the 261encoded item and finally two NUL bytes, \verb+0000+, to indicate the end 262of that type. The next type in the sequence is then encoded. If an 263item is optional and is not defined, then none of it is encoded 264including the index and NUL bytes. This is repeated until the entire 265sequence has been encoded. Two NUL bytes then mark the end of the 266sequence. 267 268\subsubsection{SequenceOf} 269 270The first two bytes are a hex \verb+3080+. Then the lists of objects are 271encoded. Two NUL bytes encode the end of the list. 272