1.. include:: ../../global.inc
2.. include:: manual_chapter_numbers.inc
3
4.. index::
5    pair: Up to date; Tutorial
6    pair: Task completion; Tutorial
7    pair: Exceptions; Tutorial
8    pair: Interrupted Pipeline; Tutorial
9
10.. _new_manual.checkpointing:
11
12######################################################################################################
13|new_manual.checkpointing.chapter_num|: Checkpointing: Interrupted Pipelines and Exceptions
14######################################################################################################
15
16
17.. seealso::
18
19    * :ref:`Manual Table of Contents <new_manual.table_of_contents>`
20
21.. note::
22
23    Remember to look at the example code:
24
25    * :ref:`new_manual.checkpointing.code`
26
27
28
29***************************************
30Overview
31***************************************
32    .. image:: ../../images/theoretical_pipeline_schematic.png
33       :scale: 50
34
35    Computational pipelines transform your data in stages until the final result is produced.
36
37    By default, *Ruffus* uses file modification times for the **input** and **output** to determine
38    whether each stage of a pipeline is up-to-date or not. But what happens when the task
39    function is interrupted, whether from the command line or by error, half way through writing the output?
40
41    In this case, the half-formed, truncated and corrupt **Output** file will look newer than its **Input** and hence up-to-date.
42
43
44.. index::
45    pair: Tutorial; interrupting tasks
46
47.. _new_manual.interrupting_tasks:
48
49***************************************
50Interrupting tasks
51***************************************
52    Let us try with an example:
53
54        .. code-block:: python
55            :emphasize-lines: 20
56
57            from ruffus import *
58            import sys, time
59
60            #   create initial files
61            @originate(['job1.start'])
62            def create_initial_files(output_file):
63                with open(output_file, "w") as oo: pass
64
65
66            #---------------------------------------------------------------
67            #
68            #   long task to interrupt
69            #
70            @transform(create_initial_files, suffix(".start"), ".output")
71            def long_task(input_files, output_file):
72                with open(output_file, "w") as ff:
73                    ff.write("Unfinished...")
74                    # sleep for 2 seconds here so you can interrupt me
75                    sys.stderr.write("Job started. Press ^C to interrupt me now...\n")
76                    time.sleep(2)
77                    ff.write("\nFinished")
78                    sys.stderr.write("Job completed.\n")
79
80
81            #       Run
82            pipeline_run([long_task])
83
84
85    When this script runs, it pauses in the middle with this message::
86
87        Job started. Press ^C to interrupt me now...
88
89    If you interrupted the script by pressing Control-C at this point, you will see that ``job1.output`` contains only ``Unfinished...``.
90    However, if you should rerun the interrupted pipeline again, Ruffus ignores the corrupt, incomplete file:
91
92        .. code-block:: pycon
93
94            >>> pipeline_run([long_task])
95            Job started. Press ^C to interrupt me now...
96            Job completed
97
98    And if you had run ``pipeline_printout``:
99
100        .. code-block:: pycon
101            :emphasize-lines: 8
102
103            >>> pipeline_printout(sys.stdout, [long_task], verbose=3)
104            ________________________________________
105            Tasks which will be run:
106
107            Task = long_task
108                   Job  = [job1.start
109                         -> job1.output]
110                     # Job needs update: Previous incomplete run leftover: [job1.output]
111
112
113    We can see that *Ruffus* magically knows that the previous run was incomplete, and that ``job1.output`` is detritus that needs to be discarded.
114
115
116.. _new_manual.logging_completed_jobs:
117
118******************************************
119Checkpointing: only log completed jobs
120******************************************
121
122    All is revealed if you were to look in the working directory. *Ruffus* has created a file called ``.ruffus_history.sqlite``.
123    In this `SQLite  <https://sqlite.org/>`_ database, *Ruffus* logs only those files which are the result of a completed job,
124    all other files are suspect.
125    This file checkpoint database is a fail-safe, not a substitute for checking file modification times. If the **Input** or **Output** files are
126    modified, the pipeline will rerun.
127
128    By default, *Ruffus* saves only file timestamps to the SQLite database but you can also add a checksum of the pipeline task function body or parameters.
129    This behaviour can be controlled by setting the ``checksum_level`` parameter
130    in ``pipeline_run()``. For example, if you do not want to save any timestamps or checksums:
131
132        .. code-block:: python
133
134            pipeline_run(checksum_level = 0)
135
136            CHECKSUM_FILE_TIMESTAMPS      = 0     # only rerun when the file timestamps are out of date (classic mode)
137            CHECKSUM_HISTORY_TIMESTAMPS   = 1     # Default: also rerun when the history shows a job as being out of date
138            CHECKSUM_FUNCTIONS            = 2     # also rerun when function body has changed
139            CHECKSUM_FUNCTIONS_AND_PARAMS = 3     # also rerun when function parameters or function body change
140
141
142    .. note::
143
144        Checksums are calculated from the `pickled  <http://docs.python.org/2/library/pickle.html>`_ string for the function code and parameters.
145        If pickling fails, Ruffus will degrade gracefully to saving just the timestamp in the SQLite database.
146
147.. _new_manual.history_files_cannot_be_shared:
148
149****************************************************************************
150Do not share the same checkpoint file across for multiple pipelines!
151****************************************************************************
152
153    The name of the Ruffus python script is not saved in the checkpoint file along side timestamps and checksums.
154    That means that you can rename your pipeline source code file without having to rerun the pipeline!
155    The tradeoff is that if multiple pipelines are run from the same directory, and save their histories to the
156    same SQlite database file, and if their file names overlap (all of these are bad ideas anyway!), this is
157    bound to be a source of confusion.
158
159    Luckily, the name and path of the checkpoint file can be also changed for each pipeline
160
161.. _new_manual.changing_history_file_name:
162
163****************************************************************************
164Setting checkpoint file names
165****************************************************************************
166
167    .. warning::
168
169        Some file systems do not appear to support SQLite at all:
170
171        There are reports that SQLite databases have `file locking problems  <http://beets.radbox.org/blog/sqlite-nightmare.html>`_ on Lustre.
172
173        The best solution would be to keep the SQLite database on an alternate compatible file system away from the working directory if possible.
174
175============================================================================================================================================================
176environment variable ``DEFAULT_RUFFUS_HISTORY_FILE``
177============================================================================================================================================================
178
179    The name of the checkpoint file is the value of the environment variable ``DEFAULT_RUFFUS_HISTORY_FILE``.
180
181        export DEFAULT_RUFFUS_HISTORY_FILE=/some/where/.ruffus_history.sqlite
182
183    This gives considerable flexibility, and allows a system-wide policy to be set so that all Ruffus checkpoint files are set logically to particular paths.
184
185    .. note::
186
187        It is your responsibility to make sure that the requisite destination directories for the checkpoint files exist beforehand!
188
189
190    Where this is missing, the checkpoint file defaults to ``.ruffus_history.sqlite`` in your working directory
191
192
193============================================================================================================================================================
194Setting the checkpoint file name manually
195============================================================================================================================================================
196
197    This checkpoint file name can always be overridden as a parameter to Ruffus functions:
198
199        .. code-block:: python
200
201            pipeline_run(history_file = "XXX")
202            pipeline_printout(history_file = "XXX")
203            pipeline_printout_graph(history_file = "XXX")
204
205
206    There is also built in support in ``Ruffus.cmdline``. So if you use this module, you can simply add to your command line:
207
208        .. code-block:: bash
209
210            # use a custom checkpoint file
211            myscript --checksum_file_name .myscript.ruffus_history.sqlite
212
213    This takes precedence over everything else.
214
215
216
217****************************************************************************
218Useful checkpoint file name policies ``DEFAULT_RUFFUS_HISTORY_FILE``
219****************************************************************************
220
221    If the pipeline script is called ``test/bin/scripts/run.me.py``, then these are the resulting checkpoint files locations:
222
223============================================================================================================================================================
224Example 1: same directory, different name
225============================================================================================================================================================
226    If the environment variable is:
227
228    .. code-block:: bash
229
230        export DEFAULT_RUFFUS_HISTORY_FILE=.{basename}.ruffus_history.sqlite
231
232    Then the job checkpoint database for ``run.me.py`` will be ``.run.me.ruffus_history.sqlite``
233
234    .. code-block:: bash
235
236        /test/bin/scripts/run.me.py
237        /common/path/for/job_history/scripts/.run.me.ruffus_history.sqlite
238
239============================================================================================================================================================
240Example 2: Different directory, same name
241============================================================================================================================================================
242
243    .. code-block:: bash
244
245        export DEFAULT_RUFFUS_HISTORY_FILE=/common/path/for/job_history/.{basename}.ruffus_history.sqlite
246
247    .. code-block:: bash
248
249        /common/path/for/job_history/.run.me.ruffus_history.sqlite
250
251
252============================================================================================================================================================
253Example 2: Different directory, same name but keep one level of subdirectory to disambiguate
254============================================================================================================================================================
255
256    .. code-block:: bash
257
258        export DEFAULT_RUFFUS_HISTORY_FILE=/common/path/for/job_history/{subdir[0]}/.{basename}.ruffus_history.sqlite
259
260
261    .. code-block:: bash
262
263        /common/path/for/job_history/scripts/.run.me.ruffus_history.sqlite
264
265
266
267============================================================================================================================================================
268Example 2: nested in common directory
269============================================================================================================================================================
270
271    .. code-block:: bash
272
273        export DEFAULT_RUFFUS_HISTORY_FILE=/common/path/for/job_history/{path}/.{basename}.ruffus_history.sqlite
274
275    .. code-block:: bash
276
277        /common/path/for/job_history/test/bin/scripts/.run.me.ruffus_history.sqlite
278
279
280
281
282.. index::
283    pair: Tutorial; Regenerating the checkpoint file
284
285.. _new_manual.regenerating_history_file:
286
287******************************************************************************
288Regenerating the checkpoint file
289******************************************************************************
290
291    Occasionally you may need to re-generate the checkpoint file.
292
293    This could be necessary:
294
295        * because you are upgrading from a previous version of Ruffus without checkpoint file support
296        * on the rare occasions when the SQLite file becomes corrupted and has to deleted
297        * if you wish to circumvent the file checking of Ruffus after making some manual changes!
298
299    To do this, it is only necessary to call ``pipeline_run`` appropriately:
300
301        .. code-block:: python
302
303            CHECKSUM_REGENERATE = 2
304            pipeline(touch_files_only = CHECKSUM_REGENERATE)
305
306
307    Similarly, if you are using ``Ruffus.cmdline``, you can call:
308
309        .. code-block:: bash
310
311            myscript --recreate_database
312
313
314    Note that this regenerates the checkpoint file to reflect the existing *Input*, *Output* files on disk.
315    In other words, the onus is on you to make sure there are no half-formed, corrupt files. On the other hand,
316    the pipeline does not need to have been previously run successfully for this to work. Essentially, Ruffus,
317    pretends to run the pipeline, while logging all the files with consistent file modication times, stopping
318    at the first tasks which appear out of date or incomplete.
319
320
321.. index::
322    pair: rules; for rerunning jobs
323
324.. _new_manual.skip_up_to_date.rules:
325
326******************************************************************************
327Rules for determining if files are up to date
328******************************************************************************
329    The following simple rules are used by *Ruffus*.
330
331    #. The pipeline stage will be rerun if:
332
333        * If any of the **Input** files are new (newer than the **Output** files)
334        * If any of the **Output** files are missing
335
336    #. In addition, it is possible to run jobs which create files from scratch.
337
338        * If no **Input** file names are supplied, the job will only run if any *output* file is missing.
339
340    #. Finally, if no **Output** file names are supplied, the job will always run.
341
342
343
344.. index::
345    pair: Exception; Missing input files
346
347******************************************************************************
348Missing files generate exceptions
349******************************************************************************
350
351    If the *inputs* files for a job are missing, the task function will have no way
352    to produce its *output*. In this case, a ``MissingInputFileError`` exception will be raised
353    automatically. For example,
354
355        ::
356
357            task.MissingInputFileError: No way to run job: Input file ['a.1'] does not exist
358            for Job = ["a.1" -> "a.2", "A file"]
359
360.. index::
361    pair: Manual; Timestamp resolution
362
363******************************************************************************
364Caveats: Coarse Timestamp resolution
365******************************************************************************
366
367    Note that modification times have precision to the nearest second under some older file systems
368    (ext2/ext3?). This may be also be true for networked file systems.
369
370    *Ruffus* supplements the file system time resolution by independently recording the timestamp at
371    full OS resolution (usually to at least the millisecond) at job completion, when presumably the **Output**
372    files will have been created.
373
374    However, *Ruffus* only does this if the discrepancy between file time and system time is less than a second
375    (due to poor file system timestamp resolution). If there are large mismatches between the two, due for example
376    to network time slippage, misconfiguration etc, *Ruffus* reverts to using the file system time and adds a one second
377    delay between jobs (via ``time.sleep()``) to make sure input and output file stamps are different.
378
379    If you know that your filesystem has coarse-grained timestamp resolution, you can always revert to this very conservative behaviour,
380    at the prices of some annoying 1s pauses, by setting :ref:`pipeline_run(one_second_per_job = True) <pipeline_functions.pipeline_run>`
381
382
383
384.. index::
385    pair: Manual; flag files
386
387******************************************************************************
388Flag files: Checkpointing for the paranoid
389******************************************************************************
390
391    One other way of checkpointing your pipelines is to create an extra "flag" file as an additional
392    **Output** file name. The flag file is only created or updated when everything else in the
393    job has completed successifully and written to disk. A missing or out of date flag file then
394    would be a sign for Ruffus that the task never completed properly in the first place.
395
396    This used to be much the best way of performing checkpointing in Ruffus and is still
397    the most bulletproof way of proceeding. For example, even the loss or corruption
398    of the checkpoint file, would not affect things greatly.
399
400    Nevertheless flag files are largely superfluous in modern *Ruffus*.
401