1.. Design: 2 3.. include:: global.inc 4 5.. index:: 6 pair: Design; Ruffus 7 8############################### 9Design & Architecture 10############################### 11 12 The *ruffus* module has the following design goals: 13 14 * Simplicity. 15 * Intuitive 16 * Lightweight 17 * Unintrusive 18 * Flexible/Powerful 19 20 21 Computational pipelines, especially in science, are best thought of in terms of data 22 flowing through successive, dependent stages (**ruffus** calls these :term:`task`\ s). 23 Traditionally, files have been used to 24 link pipelined stages together. This means that computational pipelines can be managed 25 using traditional software construction (`build`) systems. 26 27================================================= 28`GNU Make` 29================================================= 30 The grand-daddy of these is UNIX `make <http://en.wikipedia.org/wiki/Make_(software)>`_. 31 `GNU make <http://www.gnu.org/software/make/>`_ is ubiquitous in the linux world for 32 installing and compiling software. 33 It has been widely used to build computational pipelines because it supports: 34 35 * Stopping and restarting computational processes 36 * Running multiple, even thousands of jobs in parallel 37 38.. _design.make_syntax_ugly: 39 40****************************************************** 41Deficiencies of `make` / `gmake` 42****************************************************** 43 44 However, make and `GNU make <http://www.gnu.org/software/make/>`_ use a much criticised 45 specialised (domain-specific) language. The make language has poor support for modern 46 programming languages features such as variable scope, pattern matching, debugging. 47 Make scripts require large amounts of often obscure shell scripting 48 and makefiles can quickly become unmaintainable. 49 50.. _design.scons_and_rake: 51 52================================================= 53`Scons`, `Rake` and other `Make` alternatives 54================================================= 55 56 Many attempts have been made to produce a more modern version of make, with less of its 57 historical baggage. These include the Java-based `Apache ant <http://ant.apache.org/>`_ which is specified in xml. 58 59 More interesting are a new breed of build systems whose scripts are written in modern programming 60 languages, rather than a specially-invented "build" specification syntax. 61 These include the Python `scons <http://www.scons.org/>`_, Ruby `rake <http://rake.rubyforge.org/>`_ and 62 its python port `Smithy <http://packages.python.org/Smithy/>`_. 63 64 The great advantages are that computation pipelines do not need to be artificially divided 65 between (the often second-class) workflow management code, and the logic of real calculations and work 66 in the pipeline. It also means that workflow management can use all the standard language and library 67 features, for example, to read directories and match file names using regular expressions. 68 69 **Ruffus** is much like scons in that the modern dynamic programming language python is used seamlessly 70 throughout its pipeline scripts. 71 72.. _design.implicit_dependencies: 73 74************************************************************************** 75Implicit dependencies: disadvantages of `make` / `scons` / `rake` 76************************************************************************** 77 78 Although Python `scons <http://www.scons.org/>`_ and Ruby `rake <http://rake.rubyforge.org/>`_ 79 are in many ways more powerful and easier to use for building software, they are still an 80 imperfect fit to the world of computational pipelines. 81 82 This is a result of the way dependencies are specified, an essential part of their design inherited 83 from `GNU make <http://www.gnu.org/software/make/>`_. 84 85 The order of operations in all of these tools is specified in a *declarative* rather than 86 *imperative* manner. This means that the sequence of steps that a build should take are 87 not spelled out explicity and directly. Instead recipes are provided for turning input files 88 of each type to another. 89 90 So, for example, knowing that ``a->b``, ``b->c``, ``c->d``, the build 91 system can infer how to get from ``a`` to ``d`` by performing the necessary operations in the correct order. 92 93 This is immensely powerful for three reasons: 94 #) The plumbing, such as dependency checking, passing output 95 from one stage to another, are handled automatically by the build system. (This is the whole point!) 96 #) The same *recipe* can be re-used at different points in the build. 97 #) Intermediate files do not need to be retained. 98 99 Given the automatic inference that ``a->b->c->d``, 100 we don't need to keep ``b`` and ``c`` files around once ``d`` has been produced. 101 102 103 104 The disadvantage is that because stages are specified only indirectly, in terms of 105 file name matches, the flow through a complex build or a pipeline can be difficult to trace, and nigh 106 impossible to debug when there are problems. 107 108 109.. _design.explicit_dependencies_in_ruffus: 110 111************************************************************************** 112Explicit dependencies in `Ruffus` 113************************************************************************** 114 115 **Ruffus** takes a different approach. The order of operations is specified explicitly rather than inferred 116 indirectly from the input and output types. So, for example, we would explicitly specify three successive and 117 linked operations ``a->b``, ``b->c``, ``c->d``. The build system knows that the operations always proceed in 118 this order. 119 120 Looking at a **Ruffus** script, it is always clear immediately what is the succession of computational steps 121 which will be taken. 122 123 **Ruffus** values clarity over syntactic cleverness. 124 125.. _design.static_dependencies: 126 127************************************************************************** 128Static dependencies: What `make` / `scons` / `rake` can't do (easily) 129************************************************************************** 130 131 `GNU make <http://www.gnu.org/software/make/>`_, `scons <http://www.scons.org/>`_ and `rake <http://rake.rubyforge.org/>`_ 132 work by infer a static dependency (diacyclic) graph between all the files which 133 are used by a computational pipeline. These tools locate the target that they are supposed 134 to build and work backward through the dependency graph from that target, 135 rebuilding anything that is out of date.This is perfect for building software, 136 where the list of files data files can be computed **statically** at the beginning of the build. 137 138 This is not ideal matches for scientific computational pipelines because: 139 140 * | Though the *stages* of a pipeline (i.e. `compile` or `DNA alignment`) are 141 invariably well-specified in advance, the number of 142 operations (*job*\s) involved at each stage may not be. 143 | 144 145 * | A common approach is to break up large data sets into manageable chunks which 146 can be operated on in parallel in computational clusters or farms 147 (See `embarassingly parallel problems <http://en.wikipedia.org/wiki/Embarrassingly_parallel>`_). 148 | This means that the number of parallel operations or jobs varies with the data (the number of manageable chunks), 149 and dependency trees cannot be calculated statically beforehand. 150 | 151 152 Computational pipelines require **dynamic** dependencies which are not calculated up-front, but 153 at each stage of the pipeline 154 155 This is a *known* issue with traditional build systems each of which has partial strategies to work around 156 this problem: 157 158 * gmake always builds the dependencies when first invoked, so dynamic dependencies require (complex!) recursive calls to gmake 159 * `Rake dependencies unknown prior to running tasks <http://objectmix.com/ruby/759716-rake-dependencies-unknown-prior-running-tasks-2.html>`_. 160 * `Scons: Using a Source Generator to Add Targets Dynamically <http://www.scons.org/wiki/DynamicSourceGenerator>`_ 161 162 163 **Ruffus** explicitly and straightforwardly handles tasks which produce an indeterminate (i.e. runtime dependent) 164 number of output, using a **split** / **transform** / **merge** idiom. 165 166============================================================================= 167Managing pipelines stage-by-stage using **Ruffus** 168============================================================================= 169 **Ruffus** manages pipeline stages directly. 170 171 #) | The computational operations for each stage of the pipeline are written by you, in 172 separate python functions. 173 | (These correspond to `gmake pattern rules <http://www.gnu.org/software/make/manual/make.html#Pattern-Rules>`_) 174 | 175 176 #) | The dependencies between pipeline stages (python functions) are specified up-front. 177 | These can be displayed as a flow chart. 178 179 .. image:: images/front_page_flowchart.png 180 181 #) **Ruffus** makes sure pipeline stage functions are called in the right order, 182 with the right parameters, running in parallel using multiprocessing if necessary. 183 184 #) Checkpointing automatically determines if all or any parts 185 of the pipeline are out-of-date and need to be rerun. 186 187 #) Separate pipeline stages, and operations within each pipeline stage, 188 can be run in parallel provided they are not inter-dependent. 189 190 Another way of looking at this is that **ruffus** re-constructs datafile dependencies dynamically 191 on-the-fly when it gets to each stage of the pipeline, giving much more flexibility. 192 193************************************************************************** 194Disadvantages of the Ruffus design 195************************************************************************** 196 Are there any disadvantages to this trade-off for additional clarity? 197 198 #) Each pipeline stage needs to take the right input and output. For example if we specified the 199 steps in the wrong order: ``a->b``, ``c->d``, ``b->c``, then no useful output would be produced. 200 #) We cannot re-use the same recipes in different parts of the pipeline 201 #) Intermediate files need to be retained. 202 203 204 In our experience, it is always obvious when pipeline operations are in the wrong order, precisely because the 205 order of computation is the very essense of the design of each pipeline. Ruffus produces extra diagnostics when 206 no output is created in a pipeline stage (usually happens for incorrectly specified regular expressions.) 207 208 Re-use of recipes is as simple as an extra call to common function code. 209 210 Finally, some users have proposed future enhancements to **Ruffus** to handle unnecessary temporary / intermediate files. 211 212 213.. index:: 214 pair: Design; Comparison of Ruffus with alternatives 215 216================================================= 217Alternatives to **Ruffus** 218================================================= 219 220 A comparison of more make-like tools is available from `Ian Holmes' group <http://biowiki.org/MakeComparison>`_. 221 222 Build systems include: 223 224 * `GNU make <http://www.gnu.org/software/make/>`_ 225 * `scons <http://www.scons.org/>`_ 226 * `ant <http://ant.apache.org/>`_ 227 * `rake <http://rake.rubyforge.org/>`_ 228 229 There are also complete workload managements systems such as Condor. 230 Various bioinformatics pipelines are also available, including that used by the 231 leading genome annotation website Ensembl, Pegasys, GPIPE, Taverna, Wildfire, MOWserv, 232 Triana, Cyrille2 etc. These all are either hardwired to specific databases, and tasks, 233 or have steep learning curves for both the scientist/developer and the IT system 234 administrators. 235 236 **Ruffus** is designed to be lightweight and unintrusive enough to use for writing pipelines 237 with just 10 lines of code. 238 239 240.. seealso:: 241 242 243 **Bioinformatics workload managements systems** 244 245 Condor: 246 http://www.cs.wisc.edu/condor/description.html 247 248 Ensembl Analysis pipeline: 249 http://www.ncbi.nlm.nih.gov/pubmed/15123589 250 251 252 Pegasys: 253 http://www.ncbi.nlm.nih.gov/pubmed/15096276 254 255 GPIPE: 256 http://www.biomedcentral.com/pubmed/15096276 257 258 Taverna: 259 http://www.ncbi.nlm.nih.gov/pubmed/15201187 260 261 Wildfire: 262 http://www.biomedcentral.com/pubmed/15788106 263 264 MOWserv: 265 http://www.biomedcentral.com/pubmed/16257987 266 267 Triana: 268 http://dx.doi.org/10.1007/s10723-005-9007-3 269 270 Cyrille2: 271 http://www.biomedcentral.com/1471-2105/9/96 272 273 274.. index:: 275 single: Acknowledgements 276 277************************************************** 278Acknowledgements 279************************************************** 280 * Bruce Eckel's insightful article on 281 `A Decorator Based Build System <http://www.artima.com/weblogs/viewpost.jsp?thread=241209>`_ 282 was the obvious inspiration for the use of decorators in *Ruffus*. 283 284 The rest of the *Ruffus* takes uses a different approach. In particular: 285 #. *Ruffus* uses task-based not file-based dependencies 286 #. *Ruffus* tries to have minimal impact on the functions it decorates. 287 288 Bruce Eckel's design wraps functions in "rule" objects. 289 290 *Ruffus* tasks are added as attributes of the functions which can be still be 291 called normally. This is how *Ruffus* decorators can be layered in any order 292 onto the same task. 293 294 * Languages like c++ and Java would probably use a "mixin" approach. 295 Python's easy support for reflection and function references, 296 as well as the necessity of marshalling over process boundaries, dictated the 297 internal architecture of *Ruffus*. 298 * The `Boost Graph library <http://www.boost.org>`_ for text book implementations of directed 299 graph traversals. 300 * `Graphviz <http://www.graphviz.org/>`_. Just works. Wonderful. 301 * Andreas Heger, Christoffer Nellåker and Grant Belgard for driving Ruffus towards 302 ever simpler syntax. 303 304 305 306