1.. include:: ../../global.inc 2.. include:: manual_chapter_numbers.inc 3 4.. index:: 5 pair: Up to date; Tutorial 6 pair: Task completion; Tutorial 7 pair: Exceptions; Tutorial 8 pair: Interrupted Pipeline; Tutorial 9 10.. _new_manual.checkpointing: 11 12###################################################################################################### 13|new_manual.checkpointing.chapter_num|: Checkpointing: Interrupted Pipelines and Exceptions 14###################################################################################################### 15 16 17.. seealso:: 18 19 * :ref:`Manual Table of Contents <new_manual.table_of_contents>` 20 21.. note:: 22 23 Remember to look at the example code: 24 25 * :ref:`new_manual.checkpointing.code` 26 27 28 29*************************************** 30Overview 31*************************************** 32 .. image:: ../../images/theoretical_pipeline_schematic.png 33 :scale: 50 34 35 Computational pipelines transform your data in stages until the final result is produced. 36 37 By default, *Ruffus* uses file modification times for the **input** and **output** to determine 38 whether each stage of a pipeline is up-to-date or not. But what happens when the task 39 function is interrupted, whether from the command line or by error, half way through writing the output? 40 41 In this case, the half-formed, truncated and corrupt **Output** file will look newer than its **Input** and hence up-to-date. 42 43 44.. index:: 45 pair: Tutorial; interrupting tasks 46 47.. _new_manual.interrupting_tasks: 48 49*************************************** 50Interrupting tasks 51*************************************** 52 Let us try with an example: 53 54 .. code-block:: python 55 :emphasize-lines: 20 56 57 from ruffus import * 58 import sys, time 59 60 # create initial files 61 @originate(['job1.start']) 62 def create_initial_files(output_file): 63 with open(output_file, "w") as oo: pass 64 65 66 #--------------------------------------------------------------- 67 # 68 # long task to interrupt 69 # 70 @transform(create_initial_files, suffix(".start"), ".output") 71 def long_task(input_files, output_file): 72 with open(output_file, "w") as ff: 73 ff.write("Unfinished...") 74 # sleep for 2 seconds here so you can interrupt me 75 sys.stderr.write("Job started. Press ^C to interrupt me now...\n") 76 time.sleep(2) 77 ff.write("\nFinished") 78 sys.stderr.write("Job completed.\n") 79 80 81 # Run 82 pipeline_run([long_task]) 83 84 85 When this script runs, it pauses in the middle with this message:: 86 87 Job started. Press ^C to interrupt me now... 88 89 If you interrupted the script by pressing Control-C at this point, you will see that ``job1.output`` contains only ``Unfinished...``. 90 However, if you should rerun the interrupted pipeline again, Ruffus ignores the corrupt, incomplete file: 91 92 .. code-block:: pycon 93 94 >>> pipeline_run([long_task]) 95 Job started. Press ^C to interrupt me now... 96 Job completed 97 98 And if you had run ``pipeline_printout``: 99 100 .. code-block:: pycon 101 :emphasize-lines: 8 102 103 >>> pipeline_printout(sys.stdout, [long_task], verbose=3) 104 ________________________________________ 105 Tasks which will be run: 106 107 Task = long_task 108 Job = [job1.start 109 -> job1.output] 110 # Job needs update: Previous incomplete run leftover: [job1.output] 111 112 113 We can see that *Ruffus* magically knows that the previous run was incomplete, and that ``job1.output`` is detritus that needs to be discarded. 114 115 116.. _new_manual.logging_completed_jobs: 117 118****************************************** 119Checkpointing: only log completed jobs 120****************************************** 121 122 All is revealed if you were to look in the working directory. *Ruffus* has created a file called ``.ruffus_history.sqlite``. 123 In this `SQLite <https://sqlite.org/>`_ database, *Ruffus* logs only those files which are the result of a completed job, 124 all other files are suspect. 125 This file checkpoint database is a fail-safe, not a substitute for checking file modification times. If the **Input** or **Output** files are 126 modified, the pipeline will rerun. 127 128 By default, *Ruffus* saves only file timestamps to the SQLite database but you can also add a checksum of the pipeline task function body or parameters. 129 This behaviour can be controlled by setting the ``checksum_level`` parameter 130 in ``pipeline_run()``. For example, if you do not want to save any timestamps or checksums: 131 132 .. code-block:: python 133 134 pipeline_run(checksum_level = 0) 135 136 CHECKSUM_FILE_TIMESTAMPS = 0 # only rerun when the file timestamps are out of date (classic mode) 137 CHECKSUM_HISTORY_TIMESTAMPS = 1 # Default: also rerun when the history shows a job as being out of date 138 CHECKSUM_FUNCTIONS = 2 # also rerun when function body has changed 139 CHECKSUM_FUNCTIONS_AND_PARAMS = 3 # also rerun when function parameters or function body change 140 141 142 .. note:: 143 144 Checksums are calculated from the `pickled <http://docs.python.org/2/library/pickle.html>`_ string for the function code and parameters. 145 If pickling fails, Ruffus will degrade gracefully to saving just the timestamp in the SQLite database. 146 147.. _new_manual.history_files_cannot_be_shared: 148 149**************************************************************************** 150Do not share the same checkpoint file across for multiple pipelines! 151**************************************************************************** 152 153 The name of the Ruffus python script is not saved in the checkpoint file along side timestamps and checksums. 154 That means that you can rename your pipeline source code file without having to rerun the pipeline! 155 The tradeoff is that if multiple pipelines are run from the same directory, and save their histories to the 156 same SQlite database file, and if their file names overlap (all of these are bad ideas anyway!), this is 157 bound to be a source of confusion. 158 159 Luckily, the name and path of the checkpoint file can be also changed for each pipeline 160 161.. _new_manual.changing_history_file_name: 162 163**************************************************************************** 164Setting checkpoint file names 165**************************************************************************** 166 167 .. warning:: 168 169 Some file systems do not appear to support SQLite at all: 170 171 There are reports that SQLite databases have `file locking problems <http://beets.radbox.org/blog/sqlite-nightmare.html>`_ on Lustre. 172 173 The best solution would be to keep the SQLite database on an alternate compatible file system away from the working directory if possible. 174 175============================================================================================================================================================ 176environment variable ``DEFAULT_RUFFUS_HISTORY_FILE`` 177============================================================================================================================================================ 178 179 The name of the checkpoint file is the value of the environment variable ``DEFAULT_RUFFUS_HISTORY_FILE``. 180 181 export DEFAULT_RUFFUS_HISTORY_FILE=/some/where/.ruffus_history.sqlite 182 183 This gives considerable flexibility, and allows a system-wide policy to be set so that all Ruffus checkpoint files are set logically to particular paths. 184 185 .. note:: 186 187 It is your responsibility to make sure that the requisite destination directories for the checkpoint files exist beforehand! 188 189 190 Where this is missing, the checkpoint file defaults to ``.ruffus_history.sqlite`` in your working directory 191 192 193============================================================================================================================================================ 194Setting the checkpoint file name manually 195============================================================================================================================================================ 196 197 This checkpoint file name can always be overridden as a parameter to Ruffus functions: 198 199 .. code-block:: python 200 201 pipeline_run(history_file = "XXX") 202 pipeline_printout(history_file = "XXX") 203 pipeline_printout_graph(history_file = "XXX") 204 205 206 There is also built in support in ``Ruffus.cmdline``. So if you use this module, you can simply add to your command line: 207 208 .. code-block:: bash 209 210 # use a custom checkpoint file 211 myscript --checksum_file_name .myscript.ruffus_history.sqlite 212 213 This takes precedence over everything else. 214 215 216 217**************************************************************************** 218Useful checkpoint file name policies ``DEFAULT_RUFFUS_HISTORY_FILE`` 219**************************************************************************** 220 221 If the pipeline script is called ``test/bin/scripts/run.me.py``, then these are the resulting checkpoint files locations: 222 223============================================================================================================================================================ 224Example 1: same directory, different name 225============================================================================================================================================================ 226 If the environment variable is: 227 228 .. code-block:: bash 229 230 export DEFAULT_RUFFUS_HISTORY_FILE=.{basename}.ruffus_history.sqlite 231 232 Then the job checkpoint database for ``run.me.py`` will be ``.run.me.ruffus_history.sqlite`` 233 234 .. code-block:: bash 235 236 /test/bin/scripts/run.me.py 237 /common/path/for/job_history/scripts/.run.me.ruffus_history.sqlite 238 239============================================================================================================================================================ 240Example 2: Different directory, same name 241============================================================================================================================================================ 242 243 .. code-block:: bash 244 245 export DEFAULT_RUFFUS_HISTORY_FILE=/common/path/for/job_history/.{basename}.ruffus_history.sqlite 246 247 .. code-block:: bash 248 249 /common/path/for/job_history/.run.me.ruffus_history.sqlite 250 251 252============================================================================================================================================================ 253Example 2: Different directory, same name but keep one level of subdirectory to disambiguate 254============================================================================================================================================================ 255 256 .. code-block:: bash 257 258 export DEFAULT_RUFFUS_HISTORY_FILE=/common/path/for/job_history/{subdir[0]}/.{basename}.ruffus_history.sqlite 259 260 261 .. code-block:: bash 262 263 /common/path/for/job_history/scripts/.run.me.ruffus_history.sqlite 264 265 266 267============================================================================================================================================================ 268Example 2: nested in common directory 269============================================================================================================================================================ 270 271 .. code-block:: bash 272 273 export DEFAULT_RUFFUS_HISTORY_FILE=/common/path/for/job_history/{path}/.{basename}.ruffus_history.sqlite 274 275 .. code-block:: bash 276 277 /common/path/for/job_history/test/bin/scripts/.run.me.ruffus_history.sqlite 278 279 280 281 282.. index:: 283 pair: Tutorial; Regenerating the checkpoint file 284 285.. _new_manual.regenerating_history_file: 286 287****************************************************************************** 288Regenerating the checkpoint file 289****************************************************************************** 290 291 Occasionally you may need to re-generate the checkpoint file. 292 293 This could be necessary: 294 295 * because you are upgrading from a previous version of Ruffus without checkpoint file support 296 * on the rare occasions when the SQLite file becomes corrupted and has to deleted 297 * if you wish to circumvent the file checking of Ruffus after making some manual changes! 298 299 To do this, it is only necessary to call ``pipeline_run`` appropriately: 300 301 .. code-block:: python 302 303 CHECKSUM_REGENERATE = 2 304 pipeline(touch_files_only = CHECKSUM_REGENERATE) 305 306 307 Similarly, if you are using ``Ruffus.cmdline``, you can call: 308 309 .. code-block:: bash 310 311 myscript --recreate_database 312 313 314 Note that this regenerates the checkpoint file to reflect the existing *Input*, *Output* files on disk. 315 In other words, the onus is on you to make sure there are no half-formed, corrupt files. On the other hand, 316 the pipeline does not need to have been previously run successfully for this to work. Essentially, Ruffus, 317 pretends to run the pipeline, while logging all the files with consistent file modication times, stopping 318 at the first tasks which appear out of date or incomplete. 319 320 321.. index:: 322 pair: rules; for rerunning jobs 323 324.. _new_manual.skip_up_to_date.rules: 325 326****************************************************************************** 327Rules for determining if files are up to date 328****************************************************************************** 329 The following simple rules are used by *Ruffus*. 330 331 #. The pipeline stage will be rerun if: 332 333 * If any of the **Input** files are new (newer than the **Output** files) 334 * If any of the **Output** files are missing 335 336 #. In addition, it is possible to run jobs which create files from scratch. 337 338 * If no **Input** file names are supplied, the job will only run if any *output* file is missing. 339 340 #. Finally, if no **Output** file names are supplied, the job will always run. 341 342 343 344.. index:: 345 pair: Exception; Missing input files 346 347****************************************************************************** 348Missing files generate exceptions 349****************************************************************************** 350 351 If the *inputs* files for a job are missing, the task function will have no way 352 to produce its *output*. In this case, a ``MissingInputFileError`` exception will be raised 353 automatically. For example, 354 355 :: 356 357 task.MissingInputFileError: No way to run job: Input file ['a.1'] does not exist 358 for Job = ["a.1" -> "a.2", "A file"] 359 360.. index:: 361 pair: Manual; Timestamp resolution 362 363****************************************************************************** 364Caveats: Coarse Timestamp resolution 365****************************************************************************** 366 367 Note that modification times have precision to the nearest second under some older file systems 368 (ext2/ext3?). This may be also be true for networked file systems. 369 370 *Ruffus* supplements the file system time resolution by independently recording the timestamp at 371 full OS resolution (usually to at least the millisecond) at job completion, when presumably the **Output** 372 files will have been created. 373 374 However, *Ruffus* only does this if the discrepancy between file time and system time is less than a second 375 (due to poor file system timestamp resolution). If there are large mismatches between the two, due for example 376 to network time slippage, misconfiguration etc, *Ruffus* reverts to using the file system time and adds a one second 377 delay between jobs (via ``time.sleep()``) to make sure input and output file stamps are different. 378 379 If you know that your filesystem has coarse-grained timestamp resolution, you can always revert to this very conservative behaviour, 380 at the prices of some annoying 1s pauses, by setting :ref:`pipeline_run(one_second_per_job = True) <pipeline_functions.pipeline_run>` 381 382 383 384.. index:: 385 pair: Manual; flag files 386 387****************************************************************************** 388Flag files: Checkpointing for the paranoid 389****************************************************************************** 390 391 One other way of checkpointing your pipelines is to create an extra "flag" file as an additional 392 **Output** file name. The flag file is only created or updated when everything else in the 393 job has completed successifully and written to disk. A missing or out of date flag file then 394 would be a sign for Ruffus that the task never completed properly in the first place. 395 396 This used to be much the best way of performing checkpointing in Ruffus and is still 397 the most bulletproof way of proceeding. For example, even the loss or corruption 398 of the checkpoint file, would not affect things greatly. 399 400 Nevertheless flag files are largely superfluous in modern *Ruffus*. 401