• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

R/H19-Aug-2021-1,405727

man/H17-Aug-2021-446373

src/H19-Aug-2021-4434

tests/H17-Aug-2021-512404

DESCRIPTIONH A D19-Aug-2021840 2423

LICENSEH A D17-Aug-202143 32

MD5H A D19-Aug-20211.1 KiB2423

NAMESPACEH A D19-Aug-2021429 1816

NEWS.mdH A D19-Aug-2021815 4124

README.mdH A D17-Aug-202115 KiB486373

README.md

1
2-   [cachem](#cachem)
3    -   [Installation](#installation)
4    -   [Usage](#usage)
5    -   [Cache types](#cache-types)
6        -   [`cache_mem()`](#cache_mem)
7        -   [`cache_disk()`](#cache_disk)
8    -   [Cache API](#cache-api)
9    -   [Pruning](#pruning)
10    -   [Layered caches](#layered-caches)
11
12<!-- README.md is generated from README.Rmd. Please edit that file -->
13
14# cachem
15
16<!-- badges: start -->
17
18[![R build
19status](https://github.com/r-lib/cachem/workflows/R-CMD-check/badge.svg)](https://github.com/r-lib/cachem/actions)
20<!-- badges: end -->
21
22The **cachem** R package provides objects creating and managing caches.
23These cache objects are key-value stores, but unlike other basic
24key-value stores, they have built-in support for memory and age limits
25so that they won’t have unbounded growth.
26
27The cache objects in **cachem** differ from some other key-value stores
28in the following ways:
29
30-   The cache objects provide automatic pruning so that they remain
31    within memory limits.
32-   Fetching a non-existing object returns a sentinel value. An
33    alternative is to simply return `NULL`. This is what R lists and
34    environments do, but it is ambiguous whether the value really is
35    `NULL`, or if it is not present. Another alternative is to throw an
36    exception when fetching a non-existent object. However, this results
37    in more complicated code, as every `get()` needs to be wrapped in a
38    `tryCatch()`.
39
40## Installation
41
42To install the CRAN version:
43
44``` r
45install.packages("cachem")
46```
47
48You can install the development version from with:
49
50``` r
51if (!require("remotes")) install.packages("remotes")
52remotes::install_github("r-lib/cachem")
53```
54
55## Usage
56
57To create a memory-based cache, call `cache_mem()`.
58
59``` r
60library(cachem)
61m <- cache_mem()
62```
63
64Add arbitrary R objects to the cache using `$set(key, value)`:
65
66``` r
67m$set("abc123", c("Hello", "world"))
68m$set("xyz", function() message("Goodbye"))
69```
70
71The `key` must be a string consisting of lowercase letters, numbers, and
72the underscore (`_`) and hyphen (`-`) characters. (Upper-case characters
73are not allowed because some storage backends do not distinguish between
74lowercase and uppercase letters.) The `value` can be any R object.
75
76Get the values with `$get()`:
77
78``` r
79m$get("abc123")
80#> [1] "Hello" "world"
81
82m$get("xyz")
83#> function() message("Goodbye")
84```
85
86If you call `get()` on a key that doesn’t exists, it will return a
87`key_missing()` sentinel value:
88
89``` r
90m$get("dog")
91#> <Key Missing>
92```
93
94A common usage pattern is to call `get()`, and then check if the result
95is a `key_missing` object:
96
97``` r
98value <- m$get(key)
99
100if (is.key_missing(value)) {
101  # Cache miss - do something
102} else {
103  # Cache hit - do another thing
104}
105```
106
107The reason for doing this (instead of calling `$exists(key)` and then
108`$get(key)`) is that for some storage backends, there is a potential
109race condition: the object could be removed from the cache between the
110`exists()` and `get()` calls. For example:
111
112-   If multiple R processes have `cache_disk`s that share the same
113    directory, one process could remove an object from the cache in
114    between the `exists()` and `get()` calls in another process,
115    resulting in an error.
116-   If you use a `cache_mem` with a `max_age`, it’s possible for an
117    object to be present when you call `exists()`, but for its age to
118    exceed `max_age` by the time `get()` is called. In that case, the
119    `get()` will return a `key_missing()` object.
120
121``` r
122# Avoid this pattern, due to a potential race condition!
123if (m$exists(key)) {
124  value <- m$get(key)
125}
126```
127
128## Cache types
129
130**cachem** comes with two kinds of cache objects: a memory cache, and a
131disk cache.
132
133### `cache_mem()`
134
135The memory cache stores stores objects in memory, by simply keeping a
136reference to each object. To create a memory cache:
137
138``` r
139m <- cache_mem()
140```
141
142The default size of the cache is 200MB, but this can be customized with
143`max_size`:
144
145``` r
146m <- cache_mem(max_size = 10 * 1024^2)
147```
148
149It may also be useful to set a maximum age of objects. For example, if
150you only want objects to stay for a maximum of one hour:
151
152``` r
153m <- cache_mem(max_size = 10 * 1024^2, max_age = 3600)
154```
155
156For more about how objects are evicted from the cache, see section
157[Pruning](#pruning) below.
158
159An advantage that the memory cache has over the disk cache (and any
160other type of cache that stores the objects outside of the R process’s
161memory), is that it does not need to serialize objects. Instead, it
162merely stores references to the objects. This means that it can store
163objects that other caches cannot, and with more efficient use of memory
164– if two objects in the cache share some of their contents (such that
165they refer to the same sub-object in memory), then `cache_mem` will not
166create duplicate copies of the contents, as `cache_disk` would, since it
167serializes the objects with the `serialize()` function.
168
169Compared to the memory usage, the size *calculation* is not as
170intelligent: if there are two objects that share contents, their sizes
171are computed separately, even if they have items that share the exact
172same represention in memory. This is done with the `object.size()`
173function, which does not account for multiple references to the same
174object in memory.
175
176In short, a memory cache, if anything, over-counts the amount of memory
177actually consumed. In practice, this means that if you set a 200MB limit
178to the size of cache, and the cache *thinks* it has 200MB of contents,
179the actual amount of memory consumed could be less than 200MB.
180
181<details>
182<summary>
183Demonstration of memory over-counting from `object.size()`
184</summary>
185
186``` r
187# Create a and b which both contain the same numeric vector.
188x <- list(rnorm(1e5))
189a <- list(1, x)
190b <- list(2, x)
191
192# Add to cache
193m$set("a", a)
194m$set("b", b)
195
196# Each object is about 800kB in memory, so the cache_mem() will consider the
197# total memory used to be 1600kB.
198object.size(m$get("a"))
199#> 800224 bytes
200object.size(m$get("b"))
201#> 800224 bytes
202```
203
204For reference, lobstr::obj\_size can detect shared objects, and knows
205that these objects share most of their memory.
206
207``` r
208lobstr::obj_size(m$get("a"))
209#> 800,224 B
210lobstr::obj_size(list(m$get("a"), m$get("b")))
211#> 800,408 B
212```
213
214However, lobstr is not on CRAN, and if obj\_size() were used to find the
215incremental memory used when an object was added to the cache, it would
216have to walk all objects in the cache every time a single object is
217added. For these reasons, cache\_mem uses `object.size()` to compute the
218object sizes.
219
220</details>
221
222### `cache_disk()`
223
224Disk caches are stored in a directory on disk. A disk cache is slower
225than a memory cache, but can generally be larger. To create one:
226
227``` r
228d <- cache_disk()
229```
230
231By default, it creates a subdirectory of the R process’s temp directory,
232and it will persist until the R process exits.
233
234``` r
235d$info()$dir
236#>  "/tmp/Rtmp6h5iB3/cache_disk-d1901b2b615a"
237```
238
239Like a `cache_mem`, the `max_size`, `max_n`, `max_age` can be
240customized. See section [Pruning](#pruning) below for more information.
241
242Each object in the cache is stored as an RDS file on disk, using the
243`serialize()` function.
244
245``` r
246d$set("abc", 100)
247d$set("x01", list(1, 2, 3))
248
249dir(d$info()$dir)
250#> [1] "abc.rds" "x01.rds"
251```
252
253Since objects in a disk cache are serialized, they are subject to the
254limitations of the `serialize()` function. For more information, see
255section [Limitations of serialized
256objects](#limitations-of-serialized-objects).
257
258The storage directory can be specified with `dir`; it will be created if
259necessary.
260
261``` r
262cache_disk(dir = "cachedir")
263```
264
265#### Sharing a disk cache among processes
266
267Multiple R processes can use `disk_cache` objects that share the same
268cache directory. To do this, simply point each `cache_disk` to the same
269directory.
270
271#### `disk_cache` pruning
272
273For a `disk_cache`, pruning does not happen on every access, because
274finding the size of files in the cache directory can take a nontrivial
275amount of time. By default, pruning happens once every 20 times that
276`$set()` is called, or if at least five seconds have elapsed since the
277last pruning. The `prune_rate` controls how many times `$set()` must be
278called before a pruning occurs. It defaults to 20; smaller values result
279in more frequent pruning and larger values result in less frequent
280pruning (but keep in mind pruning always occurs if it has been at least
281five seconds since the last pruning).
282
283#### Cleaning up the cache directory
284
285The cache directory can be deleted by calling `$destroy()`. After it is
286destroyed, the cache object can no longer be used.
287
288``` r
289d$destroy()
290d$set("a", 1)  # Error
291```
292
293To create a `cache_disk` that will automatically delete its storage
294directory when garbage collected, use `destroy_on_finalize=TRUE`:
295
296``` r
297d <- cache_disk(destroy_on_finalize = TRUE)
298d$set("a", 1)
299
300cachedir <- d$info()$dir
301dir(cachedir)
302#> [1] "a.rds"
303
304# Remove reference to d and trigger a garbage collection
305rm(d)
306gc()
307
308dir.exists(cachedir)
309```
310
311## Cache API
312
313`cache_mem()` and `cache_disk()` support all of the methods listed
314below. If you want to create a compatible caching object, it must have
315at least the `get()` and `set()` methods:
316
317-   `get(key, missing = missing_)`: Get the object associated with
318    `key`. The `missing` parameter allows customized behavior if the key
319    is not present: it actually is an expression which is evaluated when
320    there is a cache miss, and it could return a value or throw an
321    error.
322-   `set(key, value)`: Set a key to a value.
323-   `exists(key)`: Check whether a particular key exists in the cache.
324-   `remove(key)`: Remove a key-value from the cache.
325
326Some optional methods:
327
328-   `reset()`: Clear all objects from the cache.
329-   `keys()`: Return a character vector of all keys in the cache.
330-   `prune()`: Prune the cache. (Some types of caches may not prune on
331    every access, and may temporarily grow past their limits, until the
332    next pruning is triggered automatically, or manually with this
333    function.)
334-   `size()`: Return the number of objects in the cache.
335-   `size()`: Return the number of objects in the cache.
336
337For these methods:
338
339-   `key`: can be any string with lowercase letters, numbers, underscore
340    (`_`) and hyphen (`-`). Some storage backends may not be handle very
341    long keys well. For example, with a `cache_disk()`, the key is used
342    as a filename, and on some filesystems, very filenames may hit
343    limits on path lengths.
344-   `value`: can be any R object, with some exceptions noted below.
345
346#### Limitations of serialized objects
347
348For any cache that serializes the object for storage outside of the R
349process – in other words, any cache other than a `cache_mem()` – some
350types of objects will not save and restore as well. Notably, reference
351objects may consume more memory when restored, since R may not know to
352deduplicate shared objects. External pointers are not be able to be
353serialized, since they point to memory in the R process. See
354`?serialize` for more information.
355
356#### Read-only caches
357
358It is possible to create a read-only cache by making the `set()`,
359`remove()`, `reset()`, and `prune()` methods into no-ops. This can be
360useful if sharing a cache with another R process which can write to the
361cache. For example, one (or more) processes can write to the cache, and
362other processes can read from it.
363
364This function will wrap a cache object in a read-only wrapper. Note,
365however, that code that uses such a cache must not require that `$set()`
366actually sets a value in the cache. This is good practice anyway,
367because with these cache objects, items can be pruned from them at any
368time.
369
370``` r
371cache_readonly_wrap <- function(cache) {
372  structure(
373    list(
374      get = cache$get,
375      set = function(key, value) NULL,
376      exists = cache$exists,
377      keys = cache$keys,
378      remove = function(key) NULL,
379      reset = function() NULL,
380      prune = function() NULL,
381      size = cache$size
382    ),
383    class = c("cache_readonly", class(cache))
384  )
385}
386
387mr <- cache_readonly_wrap(m)
388```
389
390## Pruning
391
392The cache objects provided by cachem have automatic pruning. (Note that
393pruning is not required by the API, so one could implement an
394API-compatible cache without pruning.)
395
396This section describes how pruning works for `cache_mem()` and
397`cache_disk()`.
398
399When the cache object is created, the maximum size (in bytes) is
400specified by `max_size`. When the size of objects in the cache exceeds
401`max_size`, objects will be pruned from the cache.
402
403When objects are pruned from the cache, which ones are removed is
404determined by the eviction policy, `evict`:
405
406-   **`lru`**: The least-recently-used objects will be removed from the
407    cache, until it fits within the limit. This is the default and is
408    appropriate for most cases.
409-   **`fifo`**: The oldest objects will be removed first.
410
411It is also possible to set the maximum number of items that can be in
412the cache, with `max_n`. By default this is set to `Inf`, or no limit.
413
414The `max_age` parameter is somewhat different from `max_size` and
415`max_n`. The latter two set limits on the cache store as a whole,
416whereas `max_age` sets limits for each individual item; for each item,
417if its age exceeds `max_age`, then it will be removed from the cache.
418
419## Layered caches
420
421Multiple caches can be composed into a single cache, using
422`cache_layered()`. This can be used to create a multi-level cache. (Note
423thate `cache_layered()` is currently experimental.) For example, we can
424create a layered cache with a very fast 100MB memory cache and a larger
425but slower 2GB disk cache:
426
427``` r
428m <- cache_mem(max_size = 100 * 1024^2)
429d <- cache_disk(max_size = 2 * 1024^3)
430
431cl <- cache_layered(m, d)
432```
433
434The layered cache will have the same API, with `$get()`, `$set()`, and
435so on, so it can be used interchangeably with other caching objects.
436
437For this example, we’ll recreate the `cache_layered` with logging
438enabled, so that it will show cache hits and misses.
439
440``` r
441cl <- cache_layered(m, d, logfile = stderr())
442
443# Each of the objects generated by rnorm() is about 40 MB
444cl$set("a", rnorm(5e6))
445cl$set("b", rnorm(5e6))
446cl$set("c", rnorm(5e6))
447
448# View the objects in each of the component caches
449m$keys()
450#> [1] "c" "b"
451d$keys()
452#> [1] "a" "b" "c"
453
454# The layered cache reports having all keys
455lc$keys()
456#> [1] "c" "b" "a"
457```
458
459When `$get()` is called, it searches the first cache, and if it’s
460missing there, it searches the next cache, and so on. If not found in
461any caches, it returns `key_missing()`.
462
463``` r
464# Get object that exists in the memory cache
465x <- cl$get("c")
466#> [2020-10-23 13:11:09.985] cache_layered Get: c
467#> [2020-10-23 13:11:09.985] cache_layered Get from cache_mem... hit
468
469# Get object that doesn't exist in the memory cache
470x <- cl$get("c")
471#> [2020-10-23 13:13:10.968] cache_layered Get: a
472#> [2020-10-23 13:13:10.969] cache_layered Get from cache_mem... miss
473#> [2020-10-23 13:13:11.329] cache_layered Get from cache_disk... hit
474
475# Object is not present in any component caches
476cl$get("d")
477#> [2020-10-23 13:13:40.197] cache_layered Get: d
478#> [2020-10-23 13:13:40.197] cache_layered Get from cache_mem... miss
479#> [2020-10-23 13:13:40.198] cache_layered Get from cache_disk... miss
480#> <Key Missing>
481```
482
483Multiple cache objects can be layered this way. You could even add a
484cache which uses a remote store, such as a network file system or even
485AWS S3.
486