1---
2title: "Installing the Arrow Package on Linux"
3output: rmarkdown::html_vignette
4vignette: >
5  %\VignetteIndexEntry{Installing the Arrow Package on Linux}
6  %\VignetteEngine{knitr::rmarkdown}
7  %\VignetteEncoding{UTF-8}
8---
9
10On macOS and Windows, when you `install.packages("arrow")`,
11you get a binary package that contains Arrow’s C++ dependencies along with it.
12On Linux, `install.packages()` retrieves a source package that has to be compiled locally,
13and C++ dependencies need to be resolved as well.
14Generally for R packages with C++ dependencies,
15this requires either installing system packages, which you may not have privileges to do,
16or building the C++ dependencies separately,
17which introduces all sorts of additional ways for things to go wrong.
18
19Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions,
20versions, and configurations as possible.
21This document describes how it works and the options for fine-tuning Linux installation.
22The intended audience for this document is `arrow` R package users on Linux, not developers.
23If you're contributing to the Arrow project, see `vignette("developing", package = "arrow") for guidance on setting up your development environment.
24
25Note also that if you use `conda` to manage your R environment, this document does not apply.
26You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official
27release of the R package along with any C++ dependencies.
28
29> Having trouble installing `arrow`? See the "Troubleshooting" section below.
30
31# Installation basics
32
33Install the latest release of `arrow` from CRAN with
34
35```r
36install.packages("arrow")
37```
38
39Daily development builds, which are not official releases,
40can be installed from the Ursa Labs repository:
41
42```r
43install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
44```
45
46or for conda users via:
47
48```
49conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
50```
51
52You can also install the R package from a git checkout:
53
54```shell
55git clone https://github.com/apache/arrow
56cd arrow/r
57R CMD INSTALL .
58```
59
60If you don't already have the Arrow C++ libraries on your system,
61when installing the R package from source, it will also download and build
62the Arrow C++ libraries for you. To speed installation up, you can set
63
64```shell
65export LIBARROW_BINARY=true
66```
67
68to look for C++ binaries prebuilt for your Linux distribution/version.
69Alternatively, you can set
70
71```shell
72export LIBARROW_MINIMAL=false
73```
74
75to build the Arrow libraries from source with optional features such as compression libraries
76enabled. This will increase the build time but provides many useful features.
77Prebuilt binaries are built with this flag enabled, so you get the full
78functionality by using them as well.
79
80Both of these variables are also set this way if you have the `NOT_CRAN=true`
81environment variable set.
82
83## Helper function: install_arrow()
84
85If you already have `arrow` installed and want to upgrade to a different version,
86install a development build, or try to reinstall and fix issues with Linux
87C++ binaries, you can call `install_arrow()`.
88`install_arrow()` provides some convenience wrappers around the various
89environment variables described below.
90This function is part of the `arrow` package,
91and it is also available as a standalone script, so you can
92access it for convenience without first installing the package:
93
94```r
95source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")
96```
97
98`install_arrow()` will install from CRAN,
99while `install_arrow(nightly = TRUE)` will give you a development build.
100`install_arrow()` does not require environment variables to be set in order to
101satisfy C++ dependencies.
102
103> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful `arrow` installation.
104
105## Offline installation
106
107The `install-arrow.R` file also includes the `create_package_with_all_dependencies()`
108function. Normally, when installing on a computer with internet access, the
109build process will download third-party dependencies as needed.
110This function provides a way to download them in advance.
111Doing so may be useful when installing Arrow on a computer without internet access.
112Note that Arrow _can_ be installed on a computer without internet access without doing this, but
113many useful features will be disabled, as they depend on third-party components.
114More precisely, `arrow::arrow_info()$capabilities()` will be `FALSE` for every
115capability.
116One approach to add more capabilities in an offline install is to prepare a
117package with pre-downloaded dependencies. The
118`create_package_with_all_dependencies()` function does this preparation.
119
120If you're using binary packages you shouldn't need to follow these steps. You
121should download the appropriate binary from your package repository, transfer
122that to the offline computer, and install that. Any OS can create the source
123bundle, but it cannot be installed on Windows. (Instead, use a standard
124Windows binary package.)
125
126Note if you're using RStudio Package Manager on Linux: If you still want to
127make a source bundle with this function, make sure to set the first repo in
128`options("repos")` to be a mirror that contains source packages (that is:
129something other than the RSPM binary mirror URLs).
130
131### Using a computer with internet access, pre-download the dependencies:
132* Install the `arrow` package _or_ run
133  `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")`
134* Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")`
135* Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access
136
137### On the computer without internet access, install the prepared package:
138* Install the `arrow` package from the copied file
139  * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))`
140  * This installation will build from source, so `cmake` must be available
141* Run `arrow_info()` to check installed capabilities
142
143#### Alternative, hands-on approach
144* Download the dependency files (`cpp/thirdparty/download_dependencies.sh` may be helpful)
145* Copy the directory of dependencies to the offline computer
146* Create the environment variable `ARROW_THIRDPARTY_DEPENDENCY_DIR` on the offline computer, pointing to the copied directory.
147* Install the `arrow` package as usual.
148
149## S3 support
150
151The `arrow` package allows you to work with data in AWS S3 or in other cloud
152storage system that emulate S3. However, support for working with S3 is not
153enabled in the default build, and it has additional system requirements. To
154enable it, set the environment variable `LIBARROW_MINIMAL=false` or
155`NOT_CRAN=true` to choose the full-featured build, or more selectively set
156`ARROW_S3=ON`. You also need the following system dependencies:
157
158* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient
159* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb)
160* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb)
161
162The prebuilt C++ binaries come with S3 support enabled, so you will need to meet
163these system requirements in order to use them--the package will not install
164without them. If you're building everything from source, the install script
165will check for the presence of these dependencies and turn off S3 support in the
166build if the prerequisites are not met--installation will succeed but without
167S3 functionality. If afterwards you install the missing system requirements,
168you'll need to reinstall the package in order to enable S3 support.
169
170# How dependencies are resolved
171
172In order for the `arrow` R package to work, it needs the Arrow C++ library.
173There are a number of ways you can get it: a system package; a library you've
174built yourself outside of the context of installing the R package;
175or, if you don't already have it, the R package will attempt to resolve it
176automatically when it installs.
177
178If you are authorized to install system packages and you're installing a CRAN release,
179you may want to use the official Apache Arrow release packages corresponding to the R package version (though there are some drawbacks: see "Troubleshooting" below).
180See the [Arrow project installation page](https://arrow.apache.org/install/)
181to find pre-compiled binary packages for some common Linux distributions,
182including Debian, Ubuntu, and CentOS.
183You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS.
184This will also automatically install the Arrow C++ library as a dependency.
185
186When you install the `arrow` R package on Linux,
187it will first attempt to find the Arrow C++ libraries on your system using
188the `pkg-config` command.
189This will find either installed system packages or libraries you've built yourself.
190In order for `install.packages("arrow")` to work with these system packages,
191you'll need to install them before installing the R package.
192
193If no Arrow C++ libraries are found on the system,
194the R package installation script will next attempt to download
195prebuilt static Arrow C++ libraries
196that match your both your local operating system and `arrow` R package version.
197C++ binaries will only be retrieved if you have set the environment variable
198`LIBARROW_BINARY` or `NOT_CRAN`.
199If found, they will be downloaded and bundled when your R package compiles.
200For a list of supported distributions and versions,
201see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project.
202
203If no C++ library binary is found, it will attempt to build it locally.
204First, it will also look to see if you are in
205a checkout of the `apache/arrow` git repository and thus have the C++ source there.
206Otherwise, it builds from the C++ files included in the package.
207Depending on your system, building Arrow C++ from source may be slow.
208
209For the specific mechanics of how all this works, see the R package `configure` script,
210which calls `tools/nixlibs.R`.
211
212If the C++ library is built from source, `inst/build_arrow_static.sh` is executed.
213This build script is also what is used to generate the prebuilt binaries.
214
215## How the package is installed - advanced
216
217This subsection contains information which is likely to be most relevant mostly
218to Arrow developers and is not necessary for Arrow users to install Arrow.
219
220There are a number of scripts that are triggered when `R CMD INSTALL .` is run.
221For Arrow users, these should all just work without configuration and pull in
222the most complete pieces (e.g. official binaries that we host).
223
224An overview of these scripts is shown below:
225
226* `configure` and `configure.win` - these scripts are triggered during
227`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They
228handle finding the Arrow library, setting up the build variables necessary, and
229writing the package Makevars file that is used to compile the C++ code in the R
230package.
231
232* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux
233(or on any non-windows OS with the environment variable
234`FORCE_BUNDLED_BUILD=true`). This sets up the build process for our bundled
235builds (which is the default on linux). The operative logic is at the end of
236the script, but it will do the following (and it will stop with the first one
237that succeeds and some of the steps are only checked if they are enabled via an
238environment variable):
239  * Check if there is an already built libarrow in `arrow/r/libarrow-{version}`,
240  use that to link against if it exists.
241  * Check if a binary is available from our hosted unofficial builds.
242  * Download the Arrow source and build the Arrow Library from source.
243  * `*** Proceed without C++` dependencies (this is an error and the package
244  will not work, but if you see this message you know the previous steps have
245  not succeeded/were not enabled)
246
247* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when the Arrow
248library is being built.  It builds Arrow for a bundled, static build, and
249mirrors the steps described in the ["Arrow R Developer Guide" vignette](./developing.html)
250
251# Troubleshooting
252
253The intent is that `install.packages("arrow")` will just work and handle all C++
254dependencies, but depending on your system, you may have better results if you
255tune one of several parameters. Here are some known complications and ways to address them.
256
257## Package failed to build C++ dependencies
258
259If you see a message like
260
261```
262------------------------- NOTE ---------------------------
263There was an issue preparing the Arrow C++ libraries.
264See https://arrow.apache.org/docs/r/articles/install.html
265---------------------------------------------------------
266```
267
268in the output when the package fails to install,
269that means that installation failed to retrieve or build C++ libraries
270compatible with the current version of the R package.
271
272It is expected that C++ dependencies should be built successfully
273on all Linux distributions, so you should not see this message. If you do,
274please check the "Known installation issues" below to see if any apply.
275If none apply, set the environment variable `ARROW_R_DEV=TRUE`
276so that details on what failed are shown, and try installing again. Then,
277please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
278and include the full verbose installation output.
279
280## Using system libraries
281
282If a system library or other installed Arrow is found but it doesn't match the R package version
283(for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0),
284it is likely that the R bindings will fail to compile.
285Because the Apache Arrow project is under active development,
286is it essential that versions of the C++ and R libraries match.
287When `install.packages("arrow")` has to download the C++ libraries,
288the install script ensures that you fetch the C++ libraries that correspond to your R package version.
289However, if you are using Arrow libraries already on your system, version match isn't guaranteed.
290
291To fix version mismatch, you can either update your system packages to match the R package version,
292or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE`
293to tell the configure script not to look for system Arrow packages.
294(The latter is the default of `install_arrow()`.)
295System packages are available corresponding to all CRAN releases
296but not for nightly or dev versions, so depending on the R package version you're installing,
297system packages may not be an option.
298
299Note also that once you have a working R package installation based on system (shared) libraries,
300if you update your system Arrow, you'll need to reinstall the R package to match its version.
301Similarly, if you're using Arrow system libraries, running `update.packages()`
302after a new release of the `arrow` package will likely fail unless you first
303update the system packages.
304
305## Using prebuilt binaries
306
307If the R package finds and downloads a prebuilt binary of the C++ library,
308but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors,
309please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
310This is likely a compiler mismatch and may be resolvable by setting some
311environment variables to instruct R to compile the packages to match the C++ library.
312
313A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE`
314and retry installation: this value instructs the package to build the C++ library from source
315instead of downloading the prebuilt binary.
316That should guarantee that the compiler settings match.
317
318If a prebuilt binary wasn't found for your operating system but you think it should have been,
319check the logs for a message that says `*** Unable to identify current OS/version`,
320or a message that says `*** No C++ binaries found for` an invalid OS.
321If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
322You may also set the environment variable `ARROW_R_DEV=TRUE` for additional
323debug messages.
324
325A workaround would be to set the environment variable `LIBARROW_BINARY`
326to a `distribution-version` that exists in the Ursa Labs repository.
327Setting `LIBARROW_BINARY` is also an option when there's not an exact match
328for your OS but a similar version would work,
329such as if you're on `ubuntu-18.10` and there's only a binary for `ubuntu-18.04`.
330
331If that workaround works for you, and you believe that it should work for everyone else too,
332you may propose [adding an entry to this lookup table](https://github.com/ursa-labs/arrow-r-nightly/edit/master/linux/distro-map.csv).
333This table is checked during the installation process
334and tells the script to use binaries built on a different operating system/version
335because they're known to work.
336
337## Building C++ from source
338
339If building the C++ library from source fails, check the error message.
340(If you don't see an error message, only the `----- NOTE -----`,
341set the environment variable `ARROW_R_DEV=TRUE` to increase verbosity and retry installation.)
342The install script should work everywhere, so if the C++ library fails to compile,
343please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
344so that we can improve the script.
345
346## Known installation issues
347
348* On CentOS, if you are using a more modern `devtoolset`, you may need to set
349the environment variables `CC` and `CXX` either in the shell or in R's `Makeconf`.
350For CentOS 7 and above, both the Arrow system packages and the C++ binaries
351for R are built with the default system compilers. If you want to use either of these
352and you have a `devtoolset` installed, set `CC=/usr/bin/gcc CXX=/usr/bin/g++`
353to use the system compilers instead of the `devtoolset`.
354Alternatively, if you want to build `arrow` with the newer `devtoolset` compilers,
355set both `ARROW_USE_PKG_CONFIG` and `LIBARROW_BINARY` to `false` so that
356you build the Arrow C++ from source using those compilers.
357Compiler mismatch between the arrow system libraries and the R
358package may cause R to segfault when `arrow` package functions are used.
359See discussions [here](https://issues.apache.org/jira/browse/ARROW-8586)
360and [here](https://issues.apache.org/jira/browse/ARROW-10780).
361
362* If you have multiple versions of `zstd` installed on your system,
363installation by building the C++ from source may fail with an undefined symbols
364error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2)
365setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
366the conflicting `zstd`.
367See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).
368
369## Summary of build environment variables
370
371Some features are optional when you build Arrow from source. With the exception of `ARROW_S3`, these are all `ON` by default in the bundled C++ build, but you can set them to `OFF` to disable them.
372
373* `ARROW_S3`: If set to `ON` S3 support will be built as long as the
374  dependencies are met; if they are not met, the build script will turn this `OFF`
375* `ARROW_JEMALLOC` for the `jemalloc` memory allocator
376* `ARROW_MIMALLOC` for the `mimalloc` memmory allocator
377* `ARROW_PARQUET`
378* `ARROW_DATASET`
379* `ARROW_JSON` for the JSON parsing library
380* `ARROW_WITH_RE2` for the RE2 regular expression library, used in some string compute functions
381* `ARROW_WITH_UTF8PROC` for the UTF8Proc string library, used in many other string compute functions
382* `ARROW_JSON` for JSON parsing
383* `ARROW_WITH_BROTLI`, `ARROW_WITH_BZ2`, `ARROW_WITH_LZ4`, `ARROW_WITH_SNAPPY`, `ARROW_WITH_ZLIB`, and `ARROW_WITH_ZSTD` for various compression algorithms
384
385
386There are a number of other variables that affect the `configure` script and the bundled build script.
387By default, these are all unset. All boolean variables are case-insensitive.
388
389* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script
390  won't look for Arrow libraries on your system and instead will look to download/build them.
391  Use this if you have a version mismatch between installed system libraries
392  and the version of the R package you're installing.
393* `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary
394  C++ library built for your operating system.
395  You may also set it to some other string,
396  a related "distro-version" that has binaries built that work for your OS.
397  If no binary is found, installation will fall back to building C++
398  dependencies from source.
399* `LIBARROW_BUILD`: If set to `false`, the build script
400  will not attempt to build the C++ from source. This means you will only get
401  a working `arrow` R package if a prebuilt binary is found.
402  Use this if you want to avoid compiling the C++ library, which may be slow
403  and resource-intensive, and ensure that you only use a prebuilt binary.
404* `LIBARROW_MINIMAL`: If set to `false`, the build script
405  will enable some optional features, including compression libraries, S3
406  support, and additional alternative memory allocators. This will increase the
407  source build time but results in a more fully functional library.
408* `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does,
409  the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false`
410  unless those environment variables are already set. This provides for a more
411  complete and fast installation experience for users who already have
412  `NOT_CRAN=true` as part of their workflow, without requiring additional
413  environment variables to be set.
414* `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed
415  in the build script. `arrow::install_arrow(verbose = TRUE)` sets this.
416  This variable also is needed if you're modifying C++
417  code in the package: see the developer guide vignette.
418* `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`),
419  there may be messages telling you to check some log file in the build directory.
420  However, when the library is built during R package installation,
421  that location is in a temp directory that is already deleted.
422  To capture those logs, set this variable to an absolute (not relative) path
423  and the log files will be copied there.
424  The directory will be created if it does not exist.
425* `CMAKE`: When building the C++ library from source, you can specify a
426  `/path/to/cmake` to use a different version than whatever is found on the `$PATH`
427
428# Contributing
429
430As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues)
431if you encounter ways to improve this. If you find that your Linux distribution
432or version is not supported, we welcome the contribution of Docker images
433(hosted on Docker Hub) that we can use in our continuous integration. These
434Docker images should be minimal, containing only R and the dependencies it
435requires. (For reference, see the images that
436[R-hub](https://github.com/r-hub/rhub-linux-builders) uses.)
437
438You can test the `arrow` R package installation using the `docker-compose`
439setup included in the `apache/arrow` git repository. For example,
440
441```
442R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r
443R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r
444```
445
446installs the `arrow` R package, including the C++ source build, on the
447[rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release)
448image.
449