1 2 3\Easel\ is a C code library for computational analysis of biological 4sequences using probabilistic models. \Easel\ is used by \HMMER\ 5\citep{hmmer,Eddy98}, the profile hidden Markov model software that 6underlies the \Pfam\ protein family database 7\citep{Finn06,Sonnhammer97} and several other protein family 8databases. \Easel\ is also used by \Infernal\ 9\citep{infernal,NawrockiEddy07}, the covariance model software that 10underlies the \Rfam\ RNA family database 11\citep{Griffiths-Jones05}. 12 13There are other biosequence analysis libraries out there, in a variety 14of languages 15\citep{Vahrson96,Pitt01,Mangalam02,Butt05,Dutheil06,Giancarlo07,Doring08}; 16but this is ours. \Easel\ is not meant to be comprehensive. \Easel 17is for supporting what's needed in our group's work on probabilistic 18modeling of biological sequences, in applications like \HMMER\ and 19\Infernal. It includes code for generative probabilistic models of 20sequences, phylogenetic models of evolution, bioinformatics tools for 21sequence manipulation and annotation, numerical computing, and some 22basic utilities. 23 24\Easel\ is written in ANSI/ISO C because its primary goals are high 25performance and portability. Additionally, \Easel\ aims to provide an 26ease of use reasonably close to Perl or Python code. 27 28\Easel\ is designed to be reused, but not only as a black box. I might 29use a black box library for routine functions that are tangential to 30my research, but for anything research-critical, I want to understand 31and control the source code. It's rational to treat reusing other 32people's code like using their toothbrush, because god only knows what 33they've done to it. For me, code reuse more often means acting like a 34magpie, studying and stealing shiny bits of other people's source 35code, and weaving them into one's own nest. \Easel\ is designed so you 36can easily pull useful baubles from it. 37 38\Easel\ is also designed to enable us to publish reproducible and 39extensible research results as supplementary material for our research 40papers. We put work into documenting \Easel\ as carefully as any other 41research data we distribute. 42 43These considerations are reflected in \Easel design decisions. 44\Easel's documentation includes tutorial examples to make it easy to 45understand and get started using any given \Easel\ module, independent 46of other parts of \Easel. \Easel\ is modular, in a way that should 47enable you to extract individual files or functions for use in your 48own code, with minimum disentanglement work. \Easel\ uses some 49precepts of object-oriented design, but its objects are just C 50structures with visible, documented contents. \Easel's source code is 51consciously designed to be read as a reference work. It reflects, in a 52modest way, principles of ``literate programming'' espoused by Donald 53Knuth. \Easel\ code and documentation are interwoven. Most of this 54book is automatically generated from \Easel's source code. 55 56 57 58\section{Quick start} 59 60Let's start with a quick tour. If you have any experience with the 61variable quality of bioinformatics software, the first thing you want 62to know is you can get Easel compiled -- without having to install a 63million dependencies first. The next thing you'll want to know is 64whether \Easel\ is going to be useful to you or not. We'll start with 65compiling it. You can compile \Easel\ and try it out without 66permanently installing it. 67 68 69 70\subsection{Downloading and compiling Easel for the first time} 71 72Easel is self-sufficient, with no dependencies other than what's 73already on your system, provided you have an ANSI C99 compiler 74installed. You can obtain an \Easel\ source tarball and compile it 75cleanly on any UNIX, Linux, or Mac OS/X operating system with an 76incantation like the following (where \ccode{xxx} will be the current 77version number): 78 79\begin{cchunk} 80% wget http://eddylab.org/easel/easel.tar.gz 81% tar zxf easel.tar.gz 82% cd easel-xxx 83% ./configure 84% make 85% make check 86\end{cchunk} 87 88The \ccode{make check} command is optional. It runs a battery of 89quality control tests. All of these should pass. You should now see 90\ccode{libeasel.a} in the directory. If you look in the directory 91\ccode{miniapps}, you'll also see a bunch of small utility programs, 92the \Easel\ ``miniapps''. 93 94There are more complicated things you can do to customize the 95\ccode{./configure} step for your needs. That includes customizing the 96installation locations. If you decide you want to install 97\Easel\ permanently, see the full installation instructions in 98chapter~\ref{chapter:installation}. 99 100 101 102\subsection{Cribbing from code examples} 103 104Every source code module (that is, each \ccode{.c} file) ends with one 105or more \esldef{driver programs}, including programs for unit tests 106and benchmarks. These are \ccode{main()} functions that can be 107conditionally included when the module is compiled. The very end of 108each module is always at least one \esldef{example driver} that shows 109you how to use the module. You can find the example code in a module 110\eslmod{foo} by searching the \ccode{esl\_foo.c} file for the tag 111\ccode{eslFOO\_EXAMPLE}, or just navigating to the end of the file. To 112compile the example for module \eslmod{foo} as a working program, do: 113 114\begin{cchunk} 115 % cc -o example -L. -I. -DeslFOO_EXAMPLE esl_foo.c -leasel -lm 116\end{cchunk} 117 118You may need to replace the standard C compiler \ccode{cc} with a 119different compiler name, depending on your system. Linking to the 120standard math library (\ccode{-lm}) may not be necessary, depending on 121what module you're compiling, but it won't hurt. Replace \ccode{foo} 122with the name of a module you want to play with, and you can compile 123any of Easel's example drivers this way. 124 125To run it, read the source code (or the corresponding section in this 126book) to see if it needs any command line arguments, like the name of 127a file to open, then: 128 129\begin{cchunk} 130 % ./example <any args needed> 131\end{cchunk} 132 133You can edit the example driver to play around with it, if you like, 134but it's better to make a copy of it in your own file (say, 135\ccode{foo\_example.c}) so you're not changing \Easel's code. When you 136extract the code into a file, copy what's between the \ccode{\#ifdef 137eslFOO\_EXAMPLE} and \ccode{\#endif /*eslFOO\_EXAMPLE*/} flags that 138conditionally include the example driver (don't copy the flags 139themselves). Then compile your example code and link to \Easel\ like 140this: 141 142\begin{cchunk} 143 % cc -o foo_example -L. -I. foo_example.c -leasel -lm 144\end{cchunk} 145 146\subsection{Cribbing from Easel miniapplications} 147 148The \ccode{miniapps} directory contains \Easel's 149\esldef{miniapplications}: several utility programs that \Easel\ 150installs, in addition to the library \ccode{libeasel.a} and its header 151files. 152 153The miniapplications are described in more detail later, but for the 154purpose of getting used to how \Easel\ is used, they provide you some 155more useful examples of small \Easel-based applications that are a 156little more complicated than individual module example drivers. 157 158You can probably get a long way into \Easel\ just by browsing the 159source code of the modules' examples and the miniapplications. If 160you're the type (like me) that prefers to learn by example, you're 161done, you can close this book now. 162 163 164 165\section{Overview of Easel's modules} 166 167Possibly your next question is, does \Easel\ provide any functionality 168you're interested in? 169 170Each \ccode{.c} file in \Easel\ corresponds to one \Easel\ 171\esldef{module}. A module consists of a group of functions for some 172task. For example, the \eslmod{sqio} module can automatically parse 173many common unaligned sequence formats, and the \eslmod{msa} module 174can parse many common multiple alignment formats. 175 176There are modules concerned with manipulating biological sequences and 177sequence files (including a full-fledged parser for Stockholm multiple 178alignment format and all its complex and powerful annotation markup): 179 180\begin{center} 181\begin{tabular}{p{1in}p{3.7in}} 182\eslmod{sq} & Single biological sequences \\ 183\eslmod{msa} & Multiple sequence alignments and i/o \\ 184\eslmod{alphabet} & Digitized biosequence alphabets \\ 185\eslmod{randomseq}& Sampling random sequences \\ 186\eslmod{sqio} & Sequence file i/o \\ 187\eslmod{ssi} & Indexing large sequence files for rapid random access \\ 188\end{tabular} 189\end{center} 190 191There are modules implementing common operations on multiple sequence 192alignments (including many published sequence weighting algorithms, 193and a memory-efficient single linkage sequence clustering algorithm): 194 195\begin{center} 196\begin{tabular}{p{1in}p{3.7in}} 197\eslmod{msacluster} & Efficient single linkage clustering of aligned sequences by \% identity\\ 198\eslmod{msaweight} & Sequence weighting algorithms \\ 199\end{tabular} 200\end{center} 201 202There are modules for probabilistic modeling of sequence residue 203alignment scores (including routines for solving for the implicit 204probabilistic basis of arbitrary score matrices): 205 206\begin{center} 207\begin{tabular}{p{1in}p{3.7in}} 208\eslmod{scorematrix} & Pairwise residue alignment scoring systems\\ 209\eslmod{ratematrix} & Standard continuous-time Markov models of residue evolution\\ 210\eslmod{paml} & Reading PAML data files (including rate matrices)\\ 211\end{tabular} 212\end{center} 213 214There is a module for sequence annotation: 215 216\begin{center} 217\begin{tabular}{p{1in}p{3.7in}} 218\eslmod{wuss} & ASCII RNA secondary structure annotation strings\\ 219\end{tabular} 220\end{center} 221 222There are modules implementing some standard scientific numerical 223computing concepts (including a free, fast implementation of conjugate 224gradient optimization): 225 226\begin{center} 227\begin{tabular}{p{1in}p{3.7in}} 228\eslmod{vectorops} & Vector operations\\ 229\eslmod{dmatrix} & 2D matrix operations\\ 230\eslmod{minimizer} & Numerical optimization by conjugate gradient descent\\ 231\eslmod{rootfinder}& One-dimensional root finding (Newton/Raphson)\\ 232\end{tabular} 233\end{center} 234 235There are modules implementing phylogenetic trees and evolutionary 236distance calculations: 237 238\begin{center} 239\begin{tabular}{p{1in}p{3.7in}} 240\eslmod{tree} & Manipulating phylogenetic trees\\ 241\eslmod{distance} & Pairwise evolutionary sequence distance calculations\\ 242\end{tabular} 243\end{center} 244 245There are a number of modules that implement routines for many common 246probability distributions (including maximum likelihood fitting 247routines): 248 249\begin{center} 250\begin{tabular}{p{1in}p{3.7in}} 251\eslmod{stats} & Basic routines and special statistics functions\\ 252\eslmod{histogram} & Collecting and displaying histograms\\ 253\eslmod{dirichlet} & Beta, Gamma, and Dirichlet distributions\\ 254\eslmod{exponential} & Exponential distributions\\ 255\eslmod{gamma} & Gamma distributions\\ 256\eslmod{gev} & Generalized extreme value distributions\\ 257\eslmod{gumbel} & Gumbel (Type I extreme value) distributions\\ 258\eslmod{hyperexp} & Hyperexponential distributions\\ 259\eslmod{mixdchlet} & Mixture Dirichlet distributions and priors\\ 260\eslmod{mixgev} & Mixture generalized extreme value distributions\\ 261\eslmod{normal} & Normal (Gaussian) distributions\\ 262\eslmod{stretchexp} & Stretched exponential distributions\\ 263\eslmod{weibull} & Weibull distributions\\ 264\end{tabular} 265\end{center} 266 267There are several modules implementing some common utilities 268(including a good portable random number generator and a powerful 269command line parser): 270 271\begin{center} 272\begin{tabular}{p{1in}p{3.7in}} 273\eslmod{cluster} & Efficient single linkage clustering\\ 274\eslmod{fileparser} & Parsing simple token-based (tab/space-delimited) files\\ 275\eslmod{getopts} & Parsing command line arguments and options.\\ 276\eslmod{keyhash} & Hash tables for emulating Perl associative arrays\\ 277\eslmod{random} & Pseudorandom number generation and sampling\\ 278\eslmod{regexp} & Regular expression matching\\ 279\eslmod{stack} & Pushdown stacks for integers, chars, pointers\\ 280\eslmod{stopwatch} & Timing parts of programs\\ 281\end{tabular} 282\end{center} 283 284There are some specialized modules in support of accelerated and/or parallel computing: 285 286\begin{center} 287\begin{tabular}{p{1in}p{3.7in}} 288\eslmod{sse} & Routines for SSE (Streaming SIMD Intrinsics) vector computation support on Intel/AMD platforms\\ 289\eslmod{vmx} & Routines for Altivec/VMX vector computation support on PowerPC platforms\\ 290\eslmod{mpi} & Routines for MPI (message passing interface) support\\ 291\end{tabular} 292\end{center} 293 294\section{Navigating documentation and source code} 295 296The quickest way to learn about what each module provides is to go to 297the corresponding chapter in this document. Each chapter starts with a 298brief introduction of what the module does, and highlights anything 299that \Easel's implementation does that we think is particularly 300useful, unique, or powerful. That's followed by a table describing 301each function provided by the module, and at least one example code 302listing of how the module can be used. The chapter might then go into 303more detail about the module's functionality, though many chapters do 304not, because the functionality is straightforward or self-explanatory. 305Finally, each chapter ends with detailed documentation on each 306function. 307 308\Easel's source code is designed to be read. Indeed, most of this 309documentation is generated automatically from the source code itself 310-- in particular, the table listing the available functions, the 311example code snippets, and the documentation of the individual 312functions. 313 314Each module \ccode{.c} file starts with a table of contents to help 315you navigate.\footnote{\Easel\ source files are designed as complete 316free-standing documents, so they tend to be larger than most people's 317\ccode{.c} files; the more usual practice in C programming is to have 318a smaller number of functions per file.} The first section will often 319define how to create one or more \esldef{objects} (C structures) that 320the module uses. The next section will typically define the rest of 321the module's exposed API. Following that are any private (internal) 322functions used in the module. Last are the drivers, including 323benchmarks, unit tests, and one or more examples. 324 325Each function has a structured comment header that describes how it is 326called and used, including what arguments it takes, what it returns, 327and what error conditions it may raise. These structured comments are 328extracted for inclusion in this document, so what you read here for 329each function's documentation is identical to what is in the source 330code. 331 332 333 334