• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

adapter/H10-Nov-2021-16,98611,257

algorithm/H10-Nov-2021-8,0164,797

c/H03-May-2022-4,4552,878

compression/H10-Nov-2021-476308

dataset/H03-May-2022-2,4461,351

dev/checkstyle/H10-Nov-2021-341258

flight/H10-Nov-2021-17,44311,082

format/H10-Nov-2021-3,6431,626

gandiva/H03-May-2022-8,3845,641

memory/H10-Nov-2021-14,3767,846

performance/H10-Nov-2021-3,0161,943

plasma/H10-Nov-2021-957525

tools/H10-Nov-2021-1,5701,129

vector/H10-Nov-2021-77,11148,566

README.mdH A D10-Nov-20216.5 KiB165105

api-changes.mdH A D10-Nov-20212 KiB3310

pom.xmlH A D10-Nov-202129 KiB840768

README.md

1<!---
2  Licensed to the Apache Software Foundation (ASF) under one
3  or more contributor license agreements.  See the NOTICE file
4  distributed with this work for additional information
5  regarding copyright ownership.  The ASF licenses this file
6  to you under the Apache License, Version 2.0 (the
7  "License"); you may not use this file except in compliance
8  with the License.  You may obtain a copy of the License at
9
10    http://www.apache.org/licenses/LICENSE-2.0
11
12  Unless required by applicable law or agreed to in writing,
13  software distributed under the License is distributed on an
14  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15  KIND, either express or implied.  See the License for the
16  specific language governing permissions and limitations
17  under the License.
18-->
19
20# Arrow Java
21
22## Getting Started
23
24The following guides explain the fundamental data structures used in the Java implementation of Apache Arrow.
25
26- [ValueVector](https://arrow.apache.org/docs/java/vector.html) is an abstraction that is used to store a sequence of values having the same type in an individual column.
27- [VectorSchemaRoot](https://arrow.apache.org/docs/java/vector_schema_root.html) is a container that can hold multiple vectors based on a schema.
28- The [Reading/Writing IPC formats](https://arrow.apache.org/docs/java/ipc.html) guide explains how to stream record batches as well as serializing record batches to files.
29
30Generated javadoc documentation is available [here](https://arrow.apache.org/docs/java/).
31
32## Setup Build Environment
33
34install:
35 - Java 8 or later
36 - Maven 3.3 or later
37
38## Building and running tests
39
40```
41git submodule update --init --recursive # Needed for flight
42cd java
43mvn install
44```
45## Building and running tests for arrow jni modules like gandiva and orc (optional)
46
47[Arrow Cpp][2] must be built before this step. The cpp build directory must
48be provided as the value for argument arrow.cpp.build.dir. eg.
49
50```
51cd java
52mvn install -P arrow-jni -am -Darrow.cpp.build.dir=../../release
53```
54
55The gandiva library is still in Alpha stages, and subject to API changes without
56deprecation warnings.
57
58## Flatbuffers dependency
59
60Arrow uses Google's Flatbuffers to transport metadata.  The java version of the library
61requires the generated flatbuffer classes can only be used with the same version that
62generated them.  Arrow packages a version of the arrow-vector module that shades flatbuffers
63and arrow-format into a single JAR.  Using the classifier "shade-format-flatbuffers" in your
64pom.xml will make use of this JAR, you can then exclude/resolve the original dependency to
65a version of your choosing.
66
67### Updating the flatbuffers generated code
68
691. Verify that your version of flatc matches the declared dependency:
70
71```bash
72$ flatc --version
73flatc version 1.12.0
74
75$ grep "dep.fbs.version" java/pom.xml
76    <dep.fbs.version>1.12.0</dep.fbs.version>
77```
78
792. Generate the flatbuffer java files by performing the following:
80
81```bash
82cd $ARROW_HOME
83
84# remove the existing files
85rm -rf java/format/src
86
87# regenerate from the .fbs files
88flatc --java -o java/format/src/main/java format/*.fbs
89
90# prepend license header
91find java/format/src -type f | while read file; do
92  (cat header | while read line; do echo "// $line"; done; cat $file) > $file.tmp
93  mv $file.tmp $file
94done
95```
96
97## Performance Tuning
98
99There are several system/environmental variables that users can configure.  These trade off safety (they turn off checking) for speed.  Typically they are only used in production settings after the code has been thoroughly tested without using them.
100
101* Bounds Checking for memory accesses: Bounds checking is on by default.  You can disable it by setting either the
102system property("arrow.enable_unsafe_memory_access") or the environmental variable
103("ARROW_ENABLE_UNSAFE_MEMORY_ACCESS") to "true". When both the system property and the environmental
104variable are set, the system property takes precedence.
105
106* null checking for gets: ValueVector get methods (not getObject) methods by default verify the slot is not null.  You can disable it by setting either the
107system property("arrow.enable_null_check_for_get") or the environmental variable
108("ARROW_ENABLE_NULL_CHECK_FOR_GET") to "false". When both the system property and the environmental
109variable are set, the system property takes precedence.
110
111## Java Properties
112
113 * For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true".
114This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty.
115 * To support duplicate fields in a `StructVector` enable "-Darrow.struct.conflict.policy=CONFLICT_APPEND".
116Duplicate fields are ignored (`CONFLICT_REPLACE`) by default and overwritten. To support different policies for
117conflicting or duplicate fields set this JVM flag or use the correct static constructor methods for `StructVector`s.
118
119## Java Code Style Guide
120
121Arrow Java follows the Google style guide [here][3] with the following
122differences:
123
124* Imports are grouped, from top to bottom, in this order: static imports,
125standard Java, org.\*, com.\*
126* Line length can be up to 120 characters
127* Operators for line wrapping are at end-of-line
128* Naming rules for methods, parameters, etc. have been relaxed
129* Disabled `NoFinalizer`, `OverloadMethodsDeclarationOrder`, and
130`VariableDeclarationUsageDistance` due to the existing code base. These rules
131should be followed when possible.
132
133Refer to `java/dev/checkstyle/checkstyle.xml for rule specifics.
134
135## Test Logging Configuration
136
137When running tests, Arrow Java uses the Logback logger with SLF4J. By default,
138it uses the logback.xml present in the corresponding module's src/test/resources
139directory, which has the default log level set to INFO.
140Arrow Java can be built with an alternate logback configuration file using the
141following command run in the project root directory:
142
143```bash
144mvn -Dlogback.configurationFile=file:<path-of-logback-file>
145```
146
147See [Logback Configuration][1] for more details.
148
149## Integration Tests
150
151Integration tests which require more time or more memory can be run by activating
152the `integration-tests` profile. This activates the [maven failsafe][4] plugin
153and any class prefixed with `IT` will be run during the testing phase. The integration
154tests currently require a larger amount of memory (>4GB) and time to complete. To activate
155the profile:
156
157```bash
158mvn -Pintegration-tests <rest of mvn arguments>
159```
160
161[1]: https://logback.qos.ch/manual/configuration.html
162[2]: https://github.com/apache/arrow/blob/master/cpp/README.md
163[3]: http://google.github.io/styleguide/javaguide.html
164[4]: https://maven.apache.org/surefire/maven-failsafe-plugin/
165