1.. Design:
2
3.. include:: global.inc
4
5.. index::
6    pair: Design; Ruffus
7
8###############################
9Design & Architecture
10###############################
11
12    The *ruffus* module has the following design goals:
13
14        * Simplicity.
15        * Intuitive
16        * Lightweight
17        * Unintrusive
18        * Flexible/Powerful
19
20
21    Computational pipelines, especially in science, are best thought of in terms of data
22    flowing through successive, dependent stages (**ruffus** calls these :term:`task`\ s).
23    Traditionally, files have been used to
24    link pipelined stages together. This means that computational pipelines can be managed
25    using traditional software construction (`build`) systems.
26
27=================================================
28`GNU Make`
29=================================================
30    The grand-daddy of these is UNIX `make <http://en.wikipedia.org/wiki/Make_(software)>`_.
31    `GNU make <http://www.gnu.org/software/make/>`_ is ubiquitous in the linux world for
32    installing and compiling software.
33    It has been widely used to build computational pipelines because it supports:
34
35    * Stopping and restarting computational processes
36    * Running multiple, even thousands of jobs in parallel
37
38.. _design.make_syntax_ugly:
39
40******************************************************
41Deficiencies of `make` / `gmake`
42******************************************************
43
44    However, make and `GNU make <http://www.gnu.org/software/make/>`_ use a much criticised
45    specialised (domain-specific) language. The make language has poor support for modern
46    programming languages features such as variable scope, pattern matching, debugging.
47    Make scripts require large amounts of often obscure shell scripting
48    and makefiles can quickly become unmaintainable.
49
50.. _design.scons_and_rake:
51
52=================================================
53`Scons`, `Rake` and other `Make` alternatives
54=================================================
55
56    Many attempts have been made to produce a more modern version of make, with less of its
57    historical baggage. These include the Java-based `Apache ant <http://ant.apache.org/>`_ which is specified in xml.
58
59    More interesting are a new breed of build systems whose scripts are written in modern programming
60    languages, rather than a specially-invented "build" specification syntax.
61    These include the Python `scons <http://www.scons.org/>`_, Ruby `rake <http://rake.rubyforge.org/>`_ and
62    its python port `Smithy <http://packages.python.org/Smithy/>`_.
63
64    The great advantages are that computation pipelines do not need to be artificially divided
65    between (the often second-class) workflow management code, and the logic of real calculations and work
66    in the pipeline. It also means that workflow management can use all the standard language and library
67    features, for example, to read directories and match file names using regular expressions.
68
69    **Ruffus** is much like scons in that the modern dynamic programming language python is used seamlessly
70    throughout its pipeline scripts.
71
72.. _design.implicit_dependencies:
73
74**************************************************************************
75Implicit dependencies: disadvantages of `make` / `scons` / `rake`
76**************************************************************************
77
78    Although Python `scons <http://www.scons.org/>`_ and Ruby `rake <http://rake.rubyforge.org/>`_
79    are in many ways more powerful and easier to use for building software, they are still an
80    imperfect fit to the world of computational pipelines.
81
82    This is a result of the way dependencies are specified, an essential part of their design inherited
83    from `GNU make <http://www.gnu.org/software/make/>`_.
84
85    The order of operations in all of these tools is specified in a *declarative* rather than
86    *imperative* manner. This means that the sequence of steps that a build should take are
87    not spelled out explicity and directly. Instead recipes are provided for turning input files
88    of each type to another.
89
90    So, for example, knowing that ``a->b``, ``b->c``, ``c->d``, the build
91    system can infer how to get from ``a`` to ``d`` by performing the necessary operations in the correct order.
92
93    This is immensely powerful for three reasons:
94     #) The plumbing, such as dependency checking, passing output
95        from one stage to another, are handled automatically by the build system. (This is the whole point!)
96     #) The same *recipe* can be re-used at different points in the build.
97     #) Intermediate files do not need to be retained.
98
99        Given the automatic inference that ``a->b->c->d``,
100        we don't need to keep ``b`` and ``c`` files around once ``d`` has been produced.
101
102
103
104    The disadvantage is that because stages are specified only indirectly, in terms of
105    file name matches, the flow through a complex build or a pipeline can be difficult to trace, and nigh
106    impossible to debug when there are problems.
107
108
109.. _design.explicit_dependencies_in_ruffus:
110
111**************************************************************************
112Explicit dependencies in `Ruffus`
113**************************************************************************
114
115    **Ruffus** takes a different approach. The order of operations is specified explicitly rather than inferred
116    indirectly from the input and output types. So, for example, we would explicitly specify three successive and
117    linked operations ``a->b``, ``b->c``, ``c->d``. The build system knows that the operations always proceed in
118    this order.
119
120    Looking at a **Ruffus** script, it is always clear immediately what is the succession of computational steps
121    which will be taken.
122
123    **Ruffus** values clarity over syntactic cleverness.
124
125.. _design.static_dependencies:
126
127**************************************************************************
128Static dependencies: What `make` / `scons` / `rake` can't do (easily)
129**************************************************************************
130
131    `GNU make <http://www.gnu.org/software/make/>`_, `scons <http://www.scons.org/>`_ and `rake <http://rake.rubyforge.org/>`_
132    work by infer a static dependency (diacyclic) graph between all the files which
133    are used by a computational pipeline. These tools locate the target that they are supposed
134    to build and work backward through the dependency graph from that target,
135    rebuilding anything that is out of date.This is perfect for building software,
136    where the list of files data files can be computed **statically** at the beginning of the build.
137
138    This is not ideal matches for scientific computational pipelines because:
139
140        *  | Though the *stages* of a pipeline (i.e. `compile` or `DNA alignment`) are
141             invariably well-specified in advance, the number of
142             operations (*job*\s) involved at each stage may not be.
143           |
144
145        *  | A common approach is to break up large data sets into manageable chunks which
146             can be operated on in parallel in computational clusters or farms
147             (See `embarassingly parallel problems <http://en.wikipedia.org/wiki/Embarrassingly_parallel>`_).
148           | This means that the number of parallel operations or jobs varies with the data (the number of manageable chunks),
149             and dependency trees cannot be calculated statically beforehand.
150           |
151
152    Computational pipelines require **dynamic** dependencies which are not calculated up-front, but
153    at each stage of the pipeline
154
155    This is a *known* issue with traditional build systems each of which has partial strategies to work around
156    this problem:
157
158        * gmake always builds the dependencies when first invoked, so dynamic dependencies require (complex!) recursive calls to gmake
159        * `Rake dependencies unknown prior to running tasks <http://objectmix.com/ruby/759716-rake-dependencies-unknown-prior-running-tasks-2.html>`_.
160        * `Scons: Using a Source Generator to Add Targets Dynamically <http://www.scons.org/wiki/DynamicSourceGenerator>`_
161
162
163    **Ruffus** explicitly and straightforwardly handles tasks which produce an indeterminate (i.e. runtime dependent)
164    number of output, using a **split** / **transform** / **merge** idiom.
165
166=============================================================================
167Managing pipelines stage-by-stage using **Ruffus**
168=============================================================================
169    **Ruffus** manages pipeline stages directly.
170
171        #) | The computational operations for each stage of the pipeline are written by you, in
172             separate python functions.
173           | (These correspond to `gmake pattern rules <http://www.gnu.org/software/make/manual/make.html#Pattern-Rules>`_)
174           |
175
176        #) | The dependencies between pipeline stages (python functions) are specified up-front.
177           | These can be displayed as a flow chart.
178
179           .. image:: images/front_page_flowchart.png
180
181        #) **Ruffus** makes sure pipeline stage functions are called in the right order,
182           with the right parameters, running in parallel using multiprocessing if necessary.
183
184        #) Checkpointing automatically determines if all or any parts
185           of the pipeline are out-of-date and need to be rerun.
186
187        #) Separate pipeline stages, and operations within each pipeline stage,
188           can be run in parallel provided they are not inter-dependent.
189
190    Another way of looking at this is that **ruffus** re-constructs datafile dependencies dynamically
191    on-the-fly when it gets to each stage of the pipeline, giving much more flexibility.
192
193**************************************************************************
194Disadvantages of the Ruffus design
195**************************************************************************
196    Are there any disadvantages to this trade-off for additional clarity?
197
198        #) Each pipeline stage needs to take the right input and output. For example if we specified the
199           steps in the wrong order: ``a->b``, ``c->d``, ``b->c``, then no useful output would be produced.
200        #) We cannot re-use the same recipes in different parts of the pipeline
201        #) Intermediate files need to be retained.
202
203
204    In our experience, it is always obvious when pipeline operations are in the wrong order, precisely because the
205    order of computation is the very essense of the design of each pipeline. Ruffus produces extra diagnostics when
206    no output is created in a pipeline stage (usually happens for incorrectly specified regular expressions.)
207
208    Re-use of recipes is as simple as an extra call to common function code.
209
210    Finally, some users have proposed future enhancements to **Ruffus** to handle unnecessary temporary / intermediate files.
211
212
213.. index::
214    pair: Design; Comparison of Ruffus with alternatives
215
216=================================================
217Alternatives to **Ruffus**
218=================================================
219
220    A comparison of more make-like tools is available from `Ian Holmes' group <http://biowiki.org/MakeComparison>`_.
221
222    Build systems include:
223
224            * `GNU make <http://www.gnu.org/software/make/>`_
225            * `scons <http://www.scons.org/>`_
226            * `ant <http://ant.apache.org/>`_
227            * `rake <http://rake.rubyforge.org/>`_
228
229    There are also complete workload managements systems such as Condor.
230    Various bioinformatics pipelines are also available, including that used by the
231    leading genome annotation website Ensembl, Pegasys, GPIPE, Taverna, Wildfire, MOWserv,
232    Triana, Cyrille2 etc. These all are either hardwired to specific databases, and tasks,
233    or have steep learning curves for both the scientist/developer and the IT system
234    administrators.
235
236    **Ruffus** is designed to be lightweight and unintrusive enough to use for writing pipelines
237    with just 10 lines of code.
238
239
240.. seealso::
241
242
243   **Bioinformatics workload managements systems**
244
245    Condor:
246        http://www.cs.wisc.edu/condor/description.html
247
248    Ensembl Analysis pipeline:
249        http://www.ncbi.nlm.nih.gov/pubmed/15123589
250
251
252    Pegasys:
253        http://www.ncbi.nlm.nih.gov/pubmed/15096276
254
255    GPIPE:
256        http://www.biomedcentral.com/pubmed/15096276
257
258    Taverna:
259        http://www.ncbi.nlm.nih.gov/pubmed/15201187
260
261    Wildfire:
262        http://www.biomedcentral.com/pubmed/15788106
263
264    MOWserv:
265        http://www.biomedcentral.com/pubmed/16257987
266
267    Triana:
268        http://dx.doi.org/10.1007/s10723-005-9007-3
269
270    Cyrille2:
271        http://www.biomedcentral.com/1471-2105/9/96
272
273
274.. index::
275    single: Acknowledgements
276
277**************************************************
278Acknowledgements
279**************************************************
280 *  Bruce Eckel's insightful article on
281    `A Decorator Based Build System <http://www.artima.com/weblogs/viewpost.jsp?thread=241209>`_
282    was the obvious inspiration for the use of decorators in *Ruffus*.
283
284    The rest of the *Ruffus* takes uses a different approach. In particular:
285        #. *Ruffus* uses task-based not file-based dependencies
286        #. *Ruffus* tries to have minimal impact on the functions it decorates.
287
288           Bruce Eckel's design wraps functions in "rule" objects.
289
290           *Ruffus* tasks are added as attributes of the functions which can be still be
291           called normally. This is how *Ruffus* decorators can be layered in any order
292           onto the same task.
293
294 *  Languages like c++ and Java would probably use a "mixin" approach.
295    Python's easy support for reflection and function references,
296    as well as the necessity of marshalling over process boundaries, dictated the
297    internal architecture of *Ruffus*.
298 *  The `Boost Graph library <http://www.boost.org>`_ for text book implementations of directed
299    graph traversals.
300 *  `Graphviz <http://www.graphviz.org/>`_. Just works. Wonderful.
301 *  Andreas Heger, Christoffer Nellåker and Grant Belgard for driving Ruffus towards
302    ever simpler syntax.
303
304
305
306