1
2
3\Easel\ is a C code library for computational analysis of biological
4sequences using probabilistic models. \Easel\ is used by \HMMER\
5\citep{hmmer,Eddy98}, the profile hidden Markov model software that
6underlies the \Pfam\ protein family database
7\citep{Finn06,Sonnhammer97} and several other protein family
8databases. \Easel\ is also used by \Infernal\
9\citep{infernal,NawrockiEddy07}, the covariance model software that
10underlies the \Rfam\ RNA family database
11\citep{Griffiths-Jones05}.
12
13There are other biosequence analysis libraries out there, in a variety
14of languages
15\citep{Vahrson96,Pitt01,Mangalam02,Butt05,Dutheil06,Giancarlo07,Doring08};
16but this is ours.  \Easel\ is not meant to be comprehensive.  \Easel
17is for supporting what's needed in our group's work on probabilistic
18modeling of biological sequences, in applications like \HMMER\ and
19\Infernal. It includes code for generative probabilistic models of
20sequences, phylogenetic models of evolution, bioinformatics tools for
21sequence manipulation and annotation, numerical computing, and some
22basic utilities.
23
24\Easel\ is written in ANSI/ISO C because its primary goals are high
25performance and portability. Additionally, \Easel\ aims to provide an
26ease of use reasonably close to Perl or Python code.
27
28\Easel\ is designed to be reused, but not only as a black box. I might
29use a black box library for routine functions that are tangential to
30my research, but for anything research-critical, I want to understand
31and control the source code.  It's rational to treat reusing other
32people's code like using their toothbrush, because god only knows what
33they've done to it. For me, code reuse more often means acting like a
34magpie, studying and stealing shiny bits of other people's source
35code, and weaving them into one's own nest. \Easel\ is designed so you
36can easily pull useful baubles from it.
37
38\Easel\ is also designed to enable us to publish reproducible and
39extensible research results as supplementary material for our research
40papers. We put work into documenting \Easel\ as carefully as any other
41research data we distribute.
42
43These considerations are reflected in \Easel design decisions.
44\Easel's documentation includes tutorial examples to make it easy to
45understand and get started using any given \Easel\ module, independent
46of other parts of \Easel.  \Easel\ is modular, in a way that should
47enable you to extract individual files or functions for use in your
48own code, with minimum disentanglement work. \Easel\ uses some
49precepts of object-oriented design, but its objects are just C
50structures with visible, documented contents. \Easel's source code is
51consciously designed to be read as a reference work. It reflects, in a
52modest way, principles of ``literate programming'' espoused by Donald
53Knuth. \Easel\ code and documentation are interwoven. Most of this
54book is automatically generated from \Easel's source code.
55
56
57
58\section{Quick start}
59
60Let's start with a quick tour. If you have any experience with the
61variable quality of bioinformatics software, the first thing you want
62to know is you can get Easel compiled -- without having to install a
63million dependencies first. The next thing you'll want to know is
64whether \Easel\ is going to be useful to you or not. We'll start with
65compiling it. You can compile \Easel\ and try it out without
66permanently installing it.
67
68
69
70\subsection{Downloading and compiling Easel for the first time}
71
72Easel is self-sufficient, with no dependencies other than what's
73already on your system, provided you have an ANSI C99 compiler
74installed.  You can obtain an \Easel\ source tarball and compile it
75cleanly on any UNIX, Linux, or Mac OS/X operating system with an
76incantation like the following (where \ccode{xxx} will be the current
77version number):
78
79\begin{cchunk}
80% wget http://eddylab.org/easel/easel.tar.gz
81% tar zxf easel.tar.gz
82% cd easel-xxx
83% ./configure
84% make
85% make check
86\end{cchunk}
87
88The \ccode{make check} command is optional. It runs a battery of
89quality control tests. All of these should pass. You should now see
90\ccode{libeasel.a} in the directory. If you look in the directory
91\ccode{miniapps}, you'll also see a bunch of small utility programs,
92the \Easel\ ``miniapps''.
93
94There are more complicated things you can do to customize the
95\ccode{./configure} step for your needs. That includes customizing the
96installation locations. If you decide you want to install
97\Easel\ permanently, see the full installation instructions in
98chapter~\ref{chapter:installation}.
99
100
101
102\subsection{Cribbing from code examples}
103
104Every source code module (that is, each \ccode{.c} file) ends with one
105or more \esldef{driver programs}, including programs for unit tests
106and benchmarks. These are \ccode{main()} functions that can be
107conditionally included when the module is compiled. The very end of
108each module is always at least one \esldef{example driver} that shows
109you how to use the module. You can find the example code in a module
110\eslmod{foo} by searching the \ccode{esl\_foo.c} file for the tag
111\ccode{eslFOO\_EXAMPLE}, or just navigating to the end of the file. To
112compile the example for module \eslmod{foo} as a working program, do:
113
114\begin{cchunk}
115   % cc -o example -L. -I. -DeslFOO_EXAMPLE esl_foo.c -leasel -lm
116\end{cchunk}
117
118You may need to replace the standard C compiler \ccode{cc} with a
119different compiler name, depending on your system. Linking to the
120standard math library (\ccode{-lm}) may not be necessary, depending on
121what module you're compiling, but it won't hurt. Replace \ccode{foo}
122with the name of a module you want to play with, and you can compile
123any of Easel's example drivers this way.
124
125To run it, read the source code (or the corresponding section in this
126book) to see if it needs any command line arguments, like the name of
127a file to open, then:
128
129\begin{cchunk}
130   % ./example <any args needed>
131\end{cchunk}
132
133You can edit the example driver to play around with it, if you like,
134but it's better to make a copy of it in your own file (say,
135\ccode{foo\_example.c}) so you're not changing \Easel's code. When you
136extract the code into a file, copy what's between the \ccode{\#ifdef
137eslFOO\_EXAMPLE} and \ccode{\#endif /*eslFOO\_EXAMPLE*/} flags that
138conditionally include the example driver (don't copy the flags
139themselves). Then compile your example code and link to \Easel\ like
140this:
141
142\begin{cchunk}
143   % cc -o foo_example -L. -I. foo_example.c -leasel -lm
144\end{cchunk}
145
146\subsection{Cribbing from Easel miniapplications}
147
148The \ccode{miniapps} directory contains \Easel's
149\esldef{miniapplications}: several utility programs that \Easel\
150installs, in addition to the library \ccode{libeasel.a} and its header
151files.
152
153The miniapplications are described in more detail later, but for the
154purpose of getting used to how \Easel\ is used, they provide you some
155more useful examples of small \Easel-based applications that are a
156little more complicated than individual module example drivers.
157
158You can probably get a long way into \Easel\ just by browsing the
159source code of the modules' examples and the miniapplications. If
160you're the type (like me) that prefers to learn by example, you're
161done, you can close this book now.
162
163
164
165\section{Overview of Easel's modules}
166
167Possibly your next question is, does \Easel\ provide any functionality
168you're interested in?
169
170Each \ccode{.c} file in \Easel\ corresponds to one \Easel\
171\esldef{module}.  A module consists of a group of functions for some
172task. For example, the \eslmod{sqio} module can automatically parse
173many common unaligned sequence formats, and the \eslmod{msa} module
174can parse many common multiple alignment formats.
175
176There are modules concerned with manipulating biological sequences and
177sequence files (including a full-fledged parser for Stockholm multiple
178alignment format and all its complex and powerful annotation markup):
179
180\begin{center}
181\begin{tabular}{p{1in}p{3.7in}}
182\eslmod{sq}       & Single biological sequences            \\
183\eslmod{msa}      & Multiple sequence alignments and i/o   \\
184\eslmod{alphabet} & Digitized biosequence alphabets        \\
185\eslmod{randomseq}& Sampling random sequences              \\
186\eslmod{sqio}     & Sequence file i/o                      \\
187\eslmod{ssi}      & Indexing large sequence files for rapid random access \\
188\end{tabular}
189\end{center}
190
191There are modules implementing common operations on multiple sequence
192alignments (including many published sequence weighting algorithms,
193and a memory-efficient single linkage sequence clustering algorithm):
194
195\begin{center}
196\begin{tabular}{p{1in}p{3.7in}}
197\eslmod{msacluster} & Efficient single linkage clustering of aligned sequences by \% identity\\
198\eslmod{msaweight}  & Sequence weighting algorithms \\
199\end{tabular}
200\end{center}
201
202There are modules for probabilistic modeling of sequence residue
203alignment scores (including routines for solving for the implicit
204probabilistic basis of arbitrary score matrices):
205
206\begin{center}
207\begin{tabular}{p{1in}p{3.7in}}
208\eslmod{scorematrix} & Pairwise residue alignment scoring systems\\
209\eslmod{ratematrix}  & Standard continuous-time Markov models of residue evolution\\
210\eslmod{paml}        & Reading PAML data files (including rate matrices)\\
211\end{tabular}
212\end{center}
213
214There is a module for sequence annotation:
215
216\begin{center}
217\begin{tabular}{p{1in}p{3.7in}}
218\eslmod{wuss} & ASCII RNA secondary structure annotation strings\\
219\end{tabular}
220\end{center}
221
222There are modules implementing some standard scientific numerical
223computing concepts (including a free, fast implementation of conjugate
224gradient optimization):
225
226\begin{center}
227\begin{tabular}{p{1in}p{3.7in}}
228\eslmod{vectorops} & Vector operations\\
229\eslmod{dmatrix}   & 2D matrix operations\\
230\eslmod{minimizer} & Numerical optimization by conjugate gradient descent\\
231\eslmod{rootfinder}& One-dimensional root finding (Newton/Raphson)\\
232\end{tabular}
233\end{center}
234
235There are modules implementing phylogenetic trees and evolutionary
236distance calculations:
237
238\begin{center}
239\begin{tabular}{p{1in}p{3.7in}}
240\eslmod{tree}     & Manipulating phylogenetic trees\\
241\eslmod{distance} & Pairwise evolutionary sequence distance calculations\\
242\end{tabular}
243\end{center}
244
245There are a number of modules that implement routines for many common
246probability distributions (including maximum likelihood fitting
247routines):
248
249\begin{center}
250\begin{tabular}{p{1in}p{3.7in}}
251\eslmod{stats}       & Basic routines and special statistics functions\\
252\eslmod{histogram}   & Collecting and displaying histograms\\
253\eslmod{dirichlet}   & Beta, Gamma, and Dirichlet distributions\\
254\eslmod{exponential} & Exponential distributions\\
255\eslmod{gamma}       & Gamma distributions\\
256\eslmod{gev}         & Generalized extreme value distributions\\
257\eslmod{gumbel}      & Gumbel (Type I extreme value) distributions\\
258\eslmod{hyperexp}    & Hyperexponential distributions\\
259\eslmod{mixdchlet}   & Mixture Dirichlet distributions and priors\\
260\eslmod{mixgev}      & Mixture generalized extreme value distributions\\
261\eslmod{normal}      & Normal (Gaussian) distributions\\
262\eslmod{stretchexp}  & Stretched exponential distributions\\
263\eslmod{weibull}     & Weibull distributions\\
264\end{tabular}
265\end{center}
266
267There are several modules implementing some common utilities
268(including a good portable random number generator and a powerful
269command line parser):
270
271\begin{center}
272\begin{tabular}{p{1in}p{3.7in}}
273\eslmod{cluster}    & Efficient single linkage clustering\\
274\eslmod{fileparser} & Parsing simple token-based (tab/space-delimited) files\\
275\eslmod{getopts}    & Parsing command line arguments and options.\\
276\eslmod{keyhash}    & Hash tables for emulating Perl associative arrays\\
277\eslmod{random}     & Pseudorandom number generation and sampling\\
278\eslmod{regexp}     & Regular expression matching\\
279\eslmod{stack}      & Pushdown stacks for integers, chars, pointers\\
280\eslmod{stopwatch}  & Timing parts of programs\\
281\end{tabular}
282\end{center}
283
284There are some specialized modules in support of accelerated and/or parallel computing:
285
286\begin{center}
287\begin{tabular}{p{1in}p{3.7in}}
288\eslmod{sse}     & Routines for SSE (Streaming SIMD Intrinsics) vector computation support on Intel/AMD platforms\\
289\eslmod{vmx}     & Routines for Altivec/VMX vector computation support on PowerPC platforms\\
290\eslmod{mpi}     & Routines for MPI (message passing interface) support\\
291\end{tabular}
292\end{center}
293
294\section{Navigating documentation and source code}
295
296The quickest way to learn about what each module provides is to go to
297the corresponding chapter in this document. Each chapter starts with a
298brief introduction of what the module does, and highlights anything
299that \Easel's implementation does that we think is particularly
300useful, unique, or powerful. That's followed by a table describing
301each function provided by the module, and at least one example code
302listing of how the module can be used. The chapter might then go into
303more detail about the module's functionality, though many chapters do
304not, because the functionality is straightforward or self-explanatory.
305Finally, each chapter ends with detailed documentation on each
306function.
307
308\Easel's source code is designed to be read. Indeed, most of this
309documentation is generated automatically from the source code itself
310-- in particular, the table listing the available functions, the
311example code snippets, and the documentation of the individual
312functions.
313
314Each module \ccode{.c} file starts with a table of contents to help
315you navigate.\footnote{\Easel\ source files are designed as complete
316free-standing documents, so they tend to be larger than most people's
317\ccode{.c} files; the more usual practice in C programming is to have
318a smaller number of functions per file.} The first section will often
319define how to create one or more \esldef{objects} (C structures) that
320the module uses. The next section will typically define the rest of
321the module's exposed API. Following that are any private (internal)
322functions used in the module. Last are the drivers, including
323benchmarks, unit tests, and one or more examples.
324
325Each function has a structured comment header that describes how it is
326called and used, including what arguments it takes, what it returns,
327and what error conditions it may raise. These structured comments are
328extracted for inclusion in this document, so what you read here for
329each function's documentation is identical to what is in the source
330code.
331
332
333
334