1..
2    For doctests:
3
4    >>> from joblib.testing import warnings_to_stdout
5    >>> warnings_to_stdout()
6
7.. _memory:
8
9===========================================
10On demand recomputing: the `Memory` class
11===========================================
12
13.. currentmodule:: joblib.memory
14
15Use case
16--------
17
18The `Memory` class defines a context for lazy evaluation of function, by
19putting the results in a store, by default using a disk, and not re-running
20the function twice for the same arguments.
21
22..
23 Commented out in favor of briefness
24
25    You can use it as a context, with its `eval` method:
26
27    .. automethod:: Memory.eval
28
29    or decorate functions with the `cache` method:
30
31    .. automethod:: Memory.cache
32
33It works by explicitly saving the output to a file and it is designed to
34work with non-hashable and potentially large input and output data types
35such as numpy arrays.
36
37A simple example:
38~~~~~~~~~~~~~~~~~
39
40  First, define the cache directory::
41
42    >>> cachedir = 'your_cache_location_directory'
43
44  Then, instantiate a memory context that uses this cache directory::
45
46    >>> from joblib import Memory
47    >>> memory = Memory(cachedir, verbose=0)
48
49  After these initial steps, just decorate a function to cache its output in
50  this context::
51
52    >>> @memory.cache
53    ... def f(x):
54    ...     print('Running f(%s)' % x)
55    ...     return x
56
57  Calling this function twice with the same argument does not execute it the
58  second time, the output is just reloaded from a pickle file in the cache
59  directory::
60
61    >>> print(f(1))
62    Running f(1)
63    1
64    >>> print(f(1))
65    1
66
67  However, calling the function with a different parameter executes it and
68  recomputes the output::
69
70    >>> print(f(2))
71    Running f(2)
72    2
73
74Comparison with `memoize`
75~~~~~~~~~~~~~~~~~~~~~~~~~
76
77The `memoize` decorator (http://code.activestate.com/recipes/52201/)
78caches in memory all the inputs and outputs of a function call. It can
79thus avoid running twice the same function, with a very small
80overhead. However, it compares input objects with those in cache on each
81call. As a result, for big objects there is a huge overhead. Moreover
82this approach does not work with numpy arrays, or other objects subject
83to non-significant fluctuations. Finally, using `memoize` with large
84objects will consume all the memory, where with `Memory`, objects are
85persisted to disk, using a persister optimized for speed and memory
86usage (:func:`joblib.dump`).
87
88In short, `memoize` is best suited for functions with "small" input and
89output objects, whereas `Memory` is best suited for functions with complex
90input and output objects, and aggressive persistence to disk.
91
92
93Using with `numpy`
94------------------
95
96The original motivation behind the `Memory` context was to have a
97memoize-like pattern on numpy arrays. `Memory` uses fast cryptographic
98hashing of the input arguments to check if they have been computed;
99
100An example
101~~~~~~~~~~
102
103  Define two functions: the first with a number as an argument,
104  outputting an array, used by the second one. Both functions are decorated
105  with `Memory.cache`::
106
107    >>> import numpy as np
108
109    >>> @memory.cache
110    ... def g(x):
111    ...     print('A long-running calculation, with parameter %s' % x)
112    ...     return np.hamming(x)
113
114    >>> @memory.cache
115    ... def h(x):
116    ...     print('A second long-running calculation, using g(x)')
117    ...     return np.vander(x)
118
119  If the function `h` is called with the array created by the same call to `g`,
120  `h` is not re-run::
121
122    >>> a = g(3)
123    A long-running calculation, with parameter 3
124    >>> a
125    array([0.08, 1.  , 0.08])
126    >>> g(3)
127    array([0.08, 1.  , 0.08])
128    >>> b = h(a)
129    A second long-running calculation, using g(x)
130    >>> b2 = h(a)
131    >>> b2
132    array([[0.0064, 0.08  , 1.    ],
133           [1.    , 1.    , 1.    ],
134           [0.0064, 0.08  , 1.    ]])
135    >>> np.allclose(b, b2)
136    True
137
138
139Using memmapping
140~~~~~~~~~~~~~~~~
141
142Memmapping (memory mapping) speeds up cache looking when reloading large numpy
143arrays::
144
145    >>> cachedir2 = 'your_cachedir2_location'
146    >>> memory2 = Memory(cachedir2, mmap_mode='r')
147    >>> square = memory2.cache(np.square)
148    >>> a = np.vander(np.arange(3)).astype(np.float)
149    >>> square(a)
150    ________________________________________________________________________________
151    [Memory] Calling square...
152    square(array([[0., 0., 1.],
153           [1., 1., 1.],
154           [4., 2., 1.]]))
155    ___________________________________________________________square - 0.0s, 0.0min
156    memmap([[ 0.,  0.,  1.],
157            [ 1.,  1.,  1.],
158            [16.,  4.,  1.]])
159
160.. note::
161
162    Notice the debug mode used in the above example. It is useful for
163    tracing of what is being reexecuted, and where the time is spent.
164
165If the `square` function is called with the same input argument, its
166return value is loaded from the disk using memmapping::
167
168    >>> res = square(a)
169    >>> print(repr(res))
170    memmap([[ 0.,  0.,  1.],
171            [ 1.,  1.,  1.],
172            [16.,  4.,  1.]])
173
174..
175
176 The memmap file must be closed to avoid file locking on Windows; closing
177 numpy.memmap objects is done with del, which flushes changes to the disk
178
179    >>> del res
180
181.. note::
182
183   If the memory mapping mode used was 'r', as in the above example, the
184   array will be read only, and will be impossible to modified in place.
185
186   On the other hand, using 'r+' or 'w+' will enable modification of the
187   array, but will propagate these modification to the disk, which will
188   corrupt the cache. If you want modification of the array in memory, we
189   suggest you use the 'c' mode: copy on write.
190
191
192Shelving: using references to cached values
193-------------------------------------------
194
195In some cases, it can be useful to get a reference to the cached
196result, instead of having the result itself. A typical example of this
197is when a lot of large numpy arrays must be dispatched across several
198workers: instead of sending the data themselves over the network, send
199a reference to the joblib cache, and let the workers read the data
200from a network filesystem, potentially taking advantage of some
201system-level caching too.
202
203Getting a reference to the cache can be done using the
204`call_and_shelve` method on the wrapped function::
205
206    >>> result = g.call_and_shelve(4)
207    A long-running calculation, with parameter 4
208    >>> result  #doctest: +ELLIPSIS
209    MemorizedResult(location="...", func="...g...", args_id="...")
210
211Once computed, the output of `g` is stored on disk, and deleted from
212memory. Reading the associated value can then be performed with the
213`get` method::
214
215    >>> result.get()
216    array([0.08, 0.77, 0.77, 0.08])
217
218The cache for this particular value can be cleared using the `clear`
219method. Its invocation causes the stored value to be erased from disk.
220Any subsequent call to `get` will cause a `KeyError` exception to be
221raised::
222
223    >>> result.clear()
224    >>> result.get()  #doctest: +SKIP
225    Traceback (most recent call last):
226    ...
227    KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist'
228
229A `MemorizedResult` instance contains all that is necessary to read
230the cached value. It can be pickled for transmission or storage, and
231the printed representation can even be copy-pasted to a different
232python interpreter.
233
234.. topic:: Shelving when cache is disabled
235
236    In the case where caching is disabled (e.g.
237    `Memory(None)`), the `call_and_shelve` method returns a
238    `NotMemorizedResult` instance, that stores the full function
239    output, instead of just a reference (since there is nothing to
240    point to). All the above remains valid though, except for the
241    copy-pasting feature.
242
243
244Gotchas
245--------
246
247* **Across sessions, function cache is identified by the function's name**.
248  Thus assigning the same name to different functions, their cache will
249  override each-others (e.g. there are 'name collisions'), and unwanted re-run
250  will happen::
251
252    >>> @memory.cache
253    ... def func(x):
254    ...     print('Running func(%s)' % x)
255
256    >>> func2 = func
257
258    >>> @memory.cache
259    ... def func(x):
260    ...     print('Running a different func(%s)' % x)
261
262  As long as the same session is used, there are no collisions (in joblib
263  0.8 and above), although joblib does warn you that you are doing something
264  dangerous::
265
266    >>> func(1)
267    Running a different func(1)
268
269    >>> # FIXME: The next line should create a JolibCollisionWarning but does not
270    >>> # memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...)
271    >>> func2(1)  #doctest: +ELLIPSIS
272    Running func(1)
273
274    >>> func(1) # No recomputation so far
275    >>> func2(1) # No recomputation so far
276
277  ..
278     Empty the in-memory cache to simulate exiting and reloading the
279     interpreter
280
281     >>> import joblib.memory
282     >>> joblib.memory._FUNCTION_HASHES.clear()
283
284  But suppose the interpreter is exited and then restarted, the cache will not
285  be identified properly, and the functions will be rerun::
286
287    >>> # FIXME: The next line will should create a JoblibCollisionWarning but does not. Also it is skipped because it does not produce any output
288    >>> # memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...)
289    >>> func(1) #doctest: +ELLIPSIS +SKIP
290    Running a different func(1)
291    >>> func2(1)  #doctest: +ELLIPSIS +SKIP
292    Running func(1)
293
294  As long as the same session is used, there are no needless
295  recomputation::
296
297    >>> func(1) # No recomputation now
298    >>> func2(1) # No recomputation now
299
300* **lambda functions**
301
302  Beware that with Python 2.7 lambda functions cannot be separated out::
303
304    >>> def my_print(x):
305    ...     print(x)
306
307    >>> f = memory.cache(lambda : my_print(1))
308    >>> g = memory.cache(lambda : my_print(2))
309
310    >>> f()
311    1
312    >>> f()
313    >>> g() # doctest: +SKIP
314    memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function '<lambda>'
315    2
316    >>> g() # doctest: +SKIP
317    >>> f() # doctest: +SKIP
318    1
319
320* **memory cannot be used on some complex objects**, e.g. a callable
321  object with a `__call__` method.
322
323  However, it works on numpy ufuncs::
324
325    >>> sin = memory.cache(np.sin)
326    >>> print(sin(0))
327    0.0
328
329* **caching methods: memory is designed for pure functions and it is
330  not recommended to use it for methods**. If one wants to use cache
331  inside a class the recommended pattern is to cache a pure function
332  and use the cached function inside your class, i.e. something like
333  this::
334
335    @memory.cache
336    def compute_func(arg1, arg2, arg3):
337        # long computation
338        return result
339
340
341    class Foo(object):
342        def __init__(self, args):
343            self.data = None
344
345        def compute(self):
346            self.data = compute_func(self.arg1, self.arg2, 40)
347
348
349  Using ``Memory`` for methods is not recommended and has some caveats
350  that make it very fragile from a maintenance point of view because
351  it is very easy to forget about these caveats when a software
352  evolves. If this cannot be avoided (we would be interested about
353  your use case by the way), here are a few known caveats:
354
355  1. a method cannot be decorated at class definition,
356     because when the class is instantiated, the first argument (self) is
357     *bound*, and no longer accessible to the `Memory` object. The
358     following code won't work::
359
360       class Foo(object):
361
362           @memory.cache  # WRONG
363           def method(self, args):
364               pass
365
366     The right way to do this is to decorate at instantiation time::
367
368       class Foo(object):
369
370           def __init__(self, args):
371               self.method = memory.cache(self.method)
372
373           def method(self, ...):
374               pass
375
376  2. The cached method will have ``self`` as one of its
377     arguments. That means that the result will be recomputed if
378     anything with ``self`` changes. For example if ``self.attr`` has
379     changed calling ``self.method`` will recompute the result even if
380     ``self.method`` does not use ``self.attr`` in its body. Another
381     example is changing ``self`` inside the body of
382     ``self.method``. The consequence is that ``self.method`` will
383     create cache that will not be reused in subsequent calls. To
384     alleviate these problems and if you *know* that the result of
385     ``self.method`` does not depend on ``self`` you can use
386     ``self.method = memory.cache(self.method, ignore=['self'])``.
387
388* **joblib cache entries may be invalidated after environment updates**.
389  Values returned by ``joblib.hash`` are not guaranteed to stay
390  constant across ``joblib`` versions. This means that **all** entries of a
391  ``joblib.Memory`` cache can get invalidated when upgrading ``joblib``.
392  Invalidation can also happen when upgrading a third party library (such as
393  ``numpy``): in such a case, only the cached function calls with parameters
394  that are constructs (or contain references to contructs) defined in the
395  upgraded library should potentially be invalidated after the uprade.
396
397
398Ignoring some arguments
399-----------------------
400
401It may be useful not to recalculate a function when certain arguments
402change, for instance a debug flag. `Memory` provides the `ignore` list::
403
404    >>> @memory.cache(ignore=['debug'])
405    ... def my_func(x, debug=True):
406    ...	    print('Called with x = %s' % x)
407    >>> my_func(0)
408    Called with x = 0
409    >>> my_func(0, debug=False)
410    >>> my_func(0, debug=True)
411    >>> # my_func was not reevaluated
412
413
414.. _memory_reference:
415
416Reference documentation of the `Memory` class
417---------------------------------------------
418
419.. autoclass:: Memory
420    :members: __init__, cache, eval, clear
421
422Useful methods of decorated functions
423-------------------------------------
424
425Function decorated by :meth:`Memory.cache` are :class:`MemorizedFunc`
426objects that, in addition of behaving like normal functions, expose
427methods useful for cache exploration and management.
428
429.. autoclass:: MemorizedFunc
430    :members: __init__, call, clear, check_call_in_cache
431
432
433..
434 Let us not forget to clean our cache dir once we are finished::
435
436    >>> import shutil
437    >>> try:
438    ...     shutil.rmtree(cachedir)
439    ...     shutil.rmtree(cachedir2)
440    ... except OSError:
441    ...     pass  # this can sometimes fail under Windows
442