1<%@meta language="R-vignette" content="--------------------------------
2%\VignetteIndexEntry{A Future for R: Best Practices for Package Developers}
3%\VignetteAuthor{Henrik Bengtsson}
4%\VignetteKeyword{R}
5%\VignetteKeyword{package}
6%\VignetteKeyword{vignette}
7%\VignetteKeyword{future}
8%\VignetteKeyword{promise}
9%\VignetteEngine{R.rsp::rsp}
10%\VignetteTangle{FALSE}
11--------------------------------------------------------------------"%>
12<%
13library("R.utils")
14`%<-%` <- future::`%<-%`
15options("withCapture/newline" = FALSE)
16%>
17
18# <%@meta name="title"%>
19
20Using future code in package is not much different from other type of package code or when using futures in R scripts.  However, there are a few things that are useful to know about in order to minimize the risk for surprises to the end user.
21
22
23## The future smell test
24
25The by far most common and popular future strategy is to parallelize on the local machine, i.e. `plan(multisession)`.  This is often good enough in most situations but note that some end-users have access to multiple machines and might want to run your code using all of them to speed it up beyond what a single machine can do.  Because of this, avoid as far as possible making assumption about your code will only run on the local machine.  A good "smell test" is to ask yourself:
26
27 _\- Will my future code work if it ends up running on the other side of the world?_
28
29Regardless of performance, if you answer "Yes", you have already embraced the core philosophy of the future framework.  If you answer "Maybe" or "No", see if you can rewrite it.
30
31For instance, if your future code made an assumption that it will have access to our local file system, as in:
32
33```r
34f <- future({
35  data <- read_tsv(file)
36  analyze(data)
37})
38```
39
40you can rewrite the code to load the content of the file before you set up the future, as in:
41
42```r
43data <- read_tsv(file)
44f <- future({
45  analyze(data)
46})
47```
48
49Similarly, we should avoid having the future code write to the local file system because the parent R session might not have access to that file system.
50
51By keeping the future smell test is mind when writing future code, we increase the chances that the code can be parallelized in more ways that on just the local computer.  Properly written future code will work regardless of what future strategy the end-user picks, e.g.
52
53```r
54plan(sequential)
55plan(multisession)
56plan(cluster, workers = rep(c("n1.remote.org", "n2.remote.org", "n3.remote.org"), each = 32))
57```
58
59Remember, as developers we never know what compute resources the end-user has access to right now or they will have access to in six month.  Who knows, your code might even end up running on 2,000 cores located on The Moon twenty years from now.
60
61
62
63## Avoid changing the future strategy
64
65For reasons like the ones mentioned above, refrain from setting `plan()` in a function.  It is better to leave it to the end-user to decided how they want to parallelize.  One reason for this is that we can never know how and in what context our code will run.  For example, they might use futures to parallelize a function call in some other package and that package code calls your package internally.  If you set `plan(multisession)` internally without undoing, you might mess up the `plan()` that is already set breaking any further parallelization.
66
67If you still think it is necessary to set `plan()`, make sure to undo when the function exits, also on errors.  This can be done by using `on.exit()`, e.g.
68
69```r
70my_fcn <- function(x) {
71  oplan <- plan(multisession)
72  on.exit(plan(oplan))
73
74  y <- analyze(x)
75  summarize(y)
76}
77```
78
79
80The need for setting the future strategy within a function often comes from developers wanting to add an argument to their function that allows the end-user to specify whether they want to run the function in parallel or sequentially.  This often result in code like:
81
82```r
83my_fcn <- function(x, parallel = FALSE) {
84  if (parallel) {
85    oplan <- plan(multisession)
86    on.exit(plan(oplan))
87    y <- future_lapply(x, FUN = analyze) ## from future.apply package
88  } else {
89    y <- lapply(x, FUN = analyze)
90  }
91  summarize(y)
92}
93```
94
95This way the user can use:
96
97```r
98y <- my_fcn(x, parallel = FALSE)
99```
100
101or
102
103```r
104y <- my_fcn(x, parallel = TRUE)
105```
106
107depending on their needs.  However, if another package developer decide to call you function in their function, they now have to expose that `parallel` argument to the users of their function, e.g.
108
109```r
110their_fcn <- function(x, parallel = FALSE) {
111  x2 <- preprocess(x)
112  y <- my_fcn(x2, parallel = parallel)
113  z <- another_fcn(y)
114  z
115}
116```
117
118Exposing and passing a "parallel" argument along can become quite cumbersome.  Instead, it is neater to use:
119
120```r
121my_fcn <- function(x) {
122  y <- future_lapply(x, FUN = analyze) ## from future.apply package
123  summarize(y)
124}
125```
126
127and let the user control whether or not they want to parallelize via `plan()`, e.g. `plan(multisession)` and `plan(sequential)`.
128
129
130## Writing examples
131
132If your example sets the future strategy at the beginning, make sure to reset the future strategy to `plan(sequential)` at the end of the example.  The reason for this is that when switching plan, the previous one will be cleaned up.  This is particularly important for multisession and cluster futures where `plan(sequential)` will shut down the underlying PSOCK clusters.
133
134For instance, here is an example:
135
136```r
137## Run the analysis in parallel on the local computer
138future::plan("multisession")
139
140y <- analyze("path/to/file.csv")
141
142## Shut down parallel workers
143future::plan("sequential")
144```
145
146If you forget to shut down the PSOCK cluster, then `R CMD check --as-cran`, or `R CMD check` with environment variable `_R_CHECK_CONNECTIONS_LEFT_OPEN_=true` set, will produce an error on
147
148```r
149$ R CMD check --as-cran mypkg_1.0.tar.gz
150...
151* checking examples ... ERROR
152Running examples in 'analyze-Ex.R' failed
153...
154> cleanEx()
155Error: connections left open:
156      <-localhost:37400 (sockconn)
157      <-localhost:37400 (sockconn)
158Execution halted
159```
160
161
162If you for some reason do not like to display reset of the future strategy in the help documentation, but you still want it run, wrap the statement in an Rd `\dontshow{}` statement.
163
164
165
166
167## Testing a package that relies on futures
168
169If you want to make sure your code works when running sequentially as well as when running in parallel, it is often good enough to have package tests that run the code with:
170
171```r
172plan(multisession)
173```
174
175If the code works with this setup, you can be sure that all global variables are properly identified and exported to the workers and that the required packages are loaded on the workers.
176
177If not all of your tests are written this way, you can set environment variable `R_FUTURE_PLAN=multisession` before you call `R CMD check`.  This will make the default future strategy to become 'multisession' instead of 'sequential'.  For example,
178
179```sh
180$ export R_FUTURE_PLAN=multisession
181$ R CMD check --as-cran mypkg_1.0.tar.gz
182```
183