1.. 2 For doctests: 3 4 >>> from joblib.testing import warnings_to_stdout 5 >>> warnings_to_stdout() 6 7.. _memory: 8 9=========================================== 10On demand recomputing: the `Memory` class 11=========================================== 12 13.. currentmodule:: joblib.memory 14 15Use case 16-------- 17 18The `Memory` class defines a context for lazy evaluation of function, by 19putting the results in a store, by default using a disk, and not re-running 20the function twice for the same arguments. 21 22.. 23 Commented out in favor of briefness 24 25 You can use it as a context, with its `eval` method: 26 27 .. automethod:: Memory.eval 28 29 or decorate functions with the `cache` method: 30 31 .. automethod:: Memory.cache 32 33It works by explicitly saving the output to a file and it is designed to 34work with non-hashable and potentially large input and output data types 35such as numpy arrays. 36 37A simple example: 38~~~~~~~~~~~~~~~~~ 39 40 First, define the cache directory:: 41 42 >>> cachedir = 'your_cache_location_directory' 43 44 Then, instantiate a memory context that uses this cache directory:: 45 46 >>> from joblib import Memory 47 >>> memory = Memory(cachedir, verbose=0) 48 49 After these initial steps, just decorate a function to cache its output in 50 this context:: 51 52 >>> @memory.cache 53 ... def f(x): 54 ... print('Running f(%s)' % x) 55 ... return x 56 57 Calling this function twice with the same argument does not execute it the 58 second time, the output is just reloaded from a pickle file in the cache 59 directory:: 60 61 >>> print(f(1)) 62 Running f(1) 63 1 64 >>> print(f(1)) 65 1 66 67 However, calling the function with a different parameter executes it and 68 recomputes the output:: 69 70 >>> print(f(2)) 71 Running f(2) 72 2 73 74Comparison with `memoize` 75~~~~~~~~~~~~~~~~~~~~~~~~~ 76 77The `memoize` decorator (http://code.activestate.com/recipes/52201/) 78caches in memory all the inputs and outputs of a function call. It can 79thus avoid running twice the same function, with a very small 80overhead. However, it compares input objects with those in cache on each 81call. As a result, for big objects there is a huge overhead. Moreover 82this approach does not work with numpy arrays, or other objects subject 83to non-significant fluctuations. Finally, using `memoize` with large 84objects will consume all the memory, where with `Memory`, objects are 85persisted to disk, using a persister optimized for speed and memory 86usage (:func:`joblib.dump`). 87 88In short, `memoize` is best suited for functions with "small" input and 89output objects, whereas `Memory` is best suited for functions with complex 90input and output objects, and aggressive persistence to disk. 91 92 93Using with `numpy` 94------------------ 95 96The original motivation behind the `Memory` context was to have a 97memoize-like pattern on numpy arrays. `Memory` uses fast cryptographic 98hashing of the input arguments to check if they have been computed; 99 100An example 101~~~~~~~~~~ 102 103 Define two functions: the first with a number as an argument, 104 outputting an array, used by the second one. Both functions are decorated 105 with `Memory.cache`:: 106 107 >>> import numpy as np 108 109 >>> @memory.cache 110 ... def g(x): 111 ... print('A long-running calculation, with parameter %s' % x) 112 ... return np.hamming(x) 113 114 >>> @memory.cache 115 ... def h(x): 116 ... print('A second long-running calculation, using g(x)') 117 ... return np.vander(x) 118 119 If the function `h` is called with the array created by the same call to `g`, 120 `h` is not re-run:: 121 122 >>> a = g(3) 123 A long-running calculation, with parameter 3 124 >>> a 125 array([0.08, 1. , 0.08]) 126 >>> g(3) 127 array([0.08, 1. , 0.08]) 128 >>> b = h(a) 129 A second long-running calculation, using g(x) 130 >>> b2 = h(a) 131 >>> b2 132 array([[0.0064, 0.08 , 1. ], 133 [1. , 1. , 1. ], 134 [0.0064, 0.08 , 1. ]]) 135 >>> np.allclose(b, b2) 136 True 137 138 139Using memmapping 140~~~~~~~~~~~~~~~~ 141 142Memmapping (memory mapping) speeds up cache looking when reloading large numpy 143arrays:: 144 145 >>> cachedir2 = 'your_cachedir2_location' 146 >>> memory2 = Memory(cachedir2, mmap_mode='r') 147 >>> square = memory2.cache(np.square) 148 >>> a = np.vander(np.arange(3)).astype(np.float) 149 >>> square(a) 150 ________________________________________________________________________________ 151 [Memory] Calling square... 152 square(array([[0., 0., 1.], 153 [1., 1., 1.], 154 [4., 2., 1.]])) 155 ___________________________________________________________square - 0.0s, 0.0min 156 memmap([[ 0., 0., 1.], 157 [ 1., 1., 1.], 158 [16., 4., 1.]]) 159 160.. note:: 161 162 Notice the debug mode used in the above example. It is useful for 163 tracing of what is being reexecuted, and where the time is spent. 164 165If the `square` function is called with the same input argument, its 166return value is loaded from the disk using memmapping:: 167 168 >>> res = square(a) 169 >>> print(repr(res)) 170 memmap([[ 0., 0., 1.], 171 [ 1., 1., 1.], 172 [16., 4., 1.]]) 173 174.. 175 176 The memmap file must be closed to avoid file locking on Windows; closing 177 numpy.memmap objects is done with del, which flushes changes to the disk 178 179 >>> del res 180 181.. note:: 182 183 If the memory mapping mode used was 'r', as in the above example, the 184 array will be read only, and will be impossible to modified in place. 185 186 On the other hand, using 'r+' or 'w+' will enable modification of the 187 array, but will propagate these modification to the disk, which will 188 corrupt the cache. If you want modification of the array in memory, we 189 suggest you use the 'c' mode: copy on write. 190 191 192Shelving: using references to cached values 193------------------------------------------- 194 195In some cases, it can be useful to get a reference to the cached 196result, instead of having the result itself. A typical example of this 197is when a lot of large numpy arrays must be dispatched across several 198workers: instead of sending the data themselves over the network, send 199a reference to the joblib cache, and let the workers read the data 200from a network filesystem, potentially taking advantage of some 201system-level caching too. 202 203Getting a reference to the cache can be done using the 204`call_and_shelve` method on the wrapped function:: 205 206 >>> result = g.call_and_shelve(4) 207 A long-running calculation, with parameter 4 208 >>> result #doctest: +ELLIPSIS 209 MemorizedResult(location="...", func="...g...", args_id="...") 210 211Once computed, the output of `g` is stored on disk, and deleted from 212memory. Reading the associated value can then be performed with the 213`get` method:: 214 215 >>> result.get() 216 array([0.08, 0.77, 0.77, 0.08]) 217 218The cache for this particular value can be cleared using the `clear` 219method. Its invocation causes the stored value to be erased from disk. 220Any subsequent call to `get` will cause a `KeyError` exception to be 221raised:: 222 223 >>> result.clear() 224 >>> result.get() #doctest: +SKIP 225 Traceback (most recent call last): 226 ... 227 KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist' 228 229A `MemorizedResult` instance contains all that is necessary to read 230the cached value. It can be pickled for transmission or storage, and 231the printed representation can even be copy-pasted to a different 232python interpreter. 233 234.. topic:: Shelving when cache is disabled 235 236 In the case where caching is disabled (e.g. 237 `Memory(None)`), the `call_and_shelve` method returns a 238 `NotMemorizedResult` instance, that stores the full function 239 output, instead of just a reference (since there is nothing to 240 point to). All the above remains valid though, except for the 241 copy-pasting feature. 242 243 244Gotchas 245-------- 246 247* **Across sessions, function cache is identified by the function's name**. 248 Thus assigning the same name to different functions, their cache will 249 override each-others (e.g. there are 'name collisions'), and unwanted re-run 250 will happen:: 251 252 >>> @memory.cache 253 ... def func(x): 254 ... print('Running func(%s)' % x) 255 256 >>> func2 = func 257 258 >>> @memory.cache 259 ... def func(x): 260 ... print('Running a different func(%s)' % x) 261 262 As long as the same session is used, there are no collisions (in joblib 263 0.8 and above), although joblib does warn you that you are doing something 264 dangerous:: 265 266 >>> func(1) 267 Running a different func(1) 268 269 >>> # FIXME: The next line should create a JolibCollisionWarning but does not 270 >>> # memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...) 271 >>> func2(1) #doctest: +ELLIPSIS 272 Running func(1) 273 274 >>> func(1) # No recomputation so far 275 >>> func2(1) # No recomputation so far 276 277 .. 278 Empty the in-memory cache to simulate exiting and reloading the 279 interpreter 280 281 >>> import joblib.memory 282 >>> joblib.memory._FUNCTION_HASHES.clear() 283 284 But suppose the interpreter is exited and then restarted, the cache will not 285 be identified properly, and the functions will be rerun:: 286 287 >>> # FIXME: The next line will should create a JoblibCollisionWarning but does not. Also it is skipped because it does not produce any output 288 >>> # memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...) 289 >>> func(1) #doctest: +ELLIPSIS +SKIP 290 Running a different func(1) 291 >>> func2(1) #doctest: +ELLIPSIS +SKIP 292 Running func(1) 293 294 As long as the same session is used, there are no needless 295 recomputation:: 296 297 >>> func(1) # No recomputation now 298 >>> func2(1) # No recomputation now 299 300* **lambda functions** 301 302 Beware that with Python 2.7 lambda functions cannot be separated out:: 303 304 >>> def my_print(x): 305 ... print(x) 306 307 >>> f = memory.cache(lambda : my_print(1)) 308 >>> g = memory.cache(lambda : my_print(2)) 309 310 >>> f() 311 1 312 >>> f() 313 >>> g() # doctest: +SKIP 314 memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function '<lambda>' 315 2 316 >>> g() # doctest: +SKIP 317 >>> f() # doctest: +SKIP 318 1 319 320* **memory cannot be used on some complex objects**, e.g. a callable 321 object with a `__call__` method. 322 323 However, it works on numpy ufuncs:: 324 325 >>> sin = memory.cache(np.sin) 326 >>> print(sin(0)) 327 0.0 328 329* **caching methods: memory is designed for pure functions and it is 330 not recommended to use it for methods**. If one wants to use cache 331 inside a class the recommended pattern is to cache a pure function 332 and use the cached function inside your class, i.e. something like 333 this:: 334 335 @memory.cache 336 def compute_func(arg1, arg2, arg3): 337 # long computation 338 return result 339 340 341 class Foo(object): 342 def __init__(self, args): 343 self.data = None 344 345 def compute(self): 346 self.data = compute_func(self.arg1, self.arg2, 40) 347 348 349 Using ``Memory`` for methods is not recommended and has some caveats 350 that make it very fragile from a maintenance point of view because 351 it is very easy to forget about these caveats when a software 352 evolves. If this cannot be avoided (we would be interested about 353 your use case by the way), here are a few known caveats: 354 355 1. a method cannot be decorated at class definition, 356 because when the class is instantiated, the first argument (self) is 357 *bound*, and no longer accessible to the `Memory` object. The 358 following code won't work:: 359 360 class Foo(object): 361 362 @memory.cache # WRONG 363 def method(self, args): 364 pass 365 366 The right way to do this is to decorate at instantiation time:: 367 368 class Foo(object): 369 370 def __init__(self, args): 371 self.method = memory.cache(self.method) 372 373 def method(self, ...): 374 pass 375 376 2. The cached method will have ``self`` as one of its 377 arguments. That means that the result will be recomputed if 378 anything with ``self`` changes. For example if ``self.attr`` has 379 changed calling ``self.method`` will recompute the result even if 380 ``self.method`` does not use ``self.attr`` in its body. Another 381 example is changing ``self`` inside the body of 382 ``self.method``. The consequence is that ``self.method`` will 383 create cache that will not be reused in subsequent calls. To 384 alleviate these problems and if you *know* that the result of 385 ``self.method`` does not depend on ``self`` you can use 386 ``self.method = memory.cache(self.method, ignore=['self'])``. 387 388* **joblib cache entries may be invalidated after environment updates**. 389 Values returned by ``joblib.hash`` are not guaranteed to stay 390 constant across ``joblib`` versions. This means that **all** entries of a 391 ``joblib.Memory`` cache can get invalidated when upgrading ``joblib``. 392 Invalidation can also happen when upgrading a third party library (such as 393 ``numpy``): in such a case, only the cached function calls with parameters 394 that are constructs (or contain references to contructs) defined in the 395 upgraded library should potentially be invalidated after the uprade. 396 397 398Ignoring some arguments 399----------------------- 400 401It may be useful not to recalculate a function when certain arguments 402change, for instance a debug flag. `Memory` provides the `ignore` list:: 403 404 >>> @memory.cache(ignore=['debug']) 405 ... def my_func(x, debug=True): 406 ... print('Called with x = %s' % x) 407 >>> my_func(0) 408 Called with x = 0 409 >>> my_func(0, debug=False) 410 >>> my_func(0, debug=True) 411 >>> # my_func was not reevaluated 412 413 414.. _memory_reference: 415 416Reference documentation of the `Memory` class 417--------------------------------------------- 418 419.. autoclass:: Memory 420 :members: __init__, cache, eval, clear 421 422Useful methods of decorated functions 423------------------------------------- 424 425Function decorated by :meth:`Memory.cache` are :class:`MemorizedFunc` 426objects that, in addition of behaving like normal functions, expose 427methods useful for cache exploration and management. 428 429.. autoclass:: MemorizedFunc 430 :members: __init__, call, clear, check_call_in_cache 431 432 433.. 434 Let us not forget to clean our cache dir once we are finished:: 435 436 >>> import shutil 437 >>> try: 438 ... shutil.rmtree(cachedir) 439 ... shutil.rmtree(cachedir2) 440 ... except OSError: 441 ... pass # this can sometimes fail under Windows 442