1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements.  See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership.  The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License.  You may obtain a copy of the License at
8
9..   http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied.  See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18.. _format_columnar:
19
20*********************
21Arrow Columnar Format
22*********************
23
24*Version: 1.0*
25
26The "Arrow Columnar Format" includes a language-agnostic in-memory
27data structure specification, metadata serialization, and a protocol
28for serialization and generic data transport.
29
30This document is intended to provide adequate detail to create a new
31implementation of the columnar format without the aid of an existing
32implementation. We utilize Google's `Flatbuffers`_ project for
33metadata serialization, so it will be necessary to refer to the
34project's `Flatbuffers protocol definition files`_
35while reading this document.
36
37The columnar format has some key features:
38
39* Data adjacency for sequential access (scans)
40* O(1) (constant-time) random access
41* SIMD and vectorization-friendly
42* Relocatable without "pointer swizzling", allowing for true zero-copy
43  access in shared memory
44
45The Arrow columnar format provides analytical performance and data
46locality guarantees in exchange for comparatively more expensive
47mutation operations. This document is concerned only with in-memory
48data representation and serialization details; issues such as
49coordinating mutation of data structures are left to be handled by
50implementations.
51
52Terminology
53===========
54
55Since different projects have used different words to describe various
56concepts, here is a small glossary to help disambiguate.
57
58* **Array** or **Vector**: a sequence of values with known length all
59  having the same type. These terms are used interchangeably in
60  different Arrow implementations, but we use "array" in this
61  document.
62* **Slot**: a single logical value in an array of some particular data type
63* **Buffer** or **Contiguous memory region**: a sequential virtual
64  address space with a given length. Any byte can be reached via a
65  single pointer offset less than the region's length.
66* **Physical Layout**: The underlying memory layout for an array
67  without taking into account any value semantics. For example, a
68  32-bit signed integer array and 32-bit floating point array have the
69  same layout.
70* **Parent** and **child arrays**: names to express relationships
71  between physical value arrays in a nested type structure. For
72  example, a ``List<T>``-type parent array has a T-type array as its
73  child (see more on lists below).
74* **Primitive type**: a data type having no child types. This includes
75  such types as fixed bit-width, variable-size binary, and null types.
76* **Nested type**: a data type whose full structure depends on one or
77  more other child types. Two fully-specified nested types are equal
78  if and only if their child types are equal. For example, ``List<U>``
79  is distinct from ``List<V>`` iff U and V are different types.
80* **Logical type**: An application-facing semantic value type that is
81  implemented using some physical layout. For example, Decimal
82  values are stored as 16 bytes in a fixed-size binary
83  layout. Similarly, strings can be stored as ``List<1-byte>``. A
84  timestamp may be stored as 64-bit fixed-size layout.
85
86Physical Memory Layout
87======================
88
89Arrays are defined by a few pieces of metadata and data:
90
91* A logical data type.
92* A sequence of buffers.
93* A length as a 64-bit signed integer. Implementations are permitted
94  to be limited to 32-bit lengths, see more on this below.
95* A null count as a 64-bit signed integer.
96* An optional **dictionary**, for dictionary-encoded arrays.
97
98Nested arrays additionally have a sequence of one or more sets of
99these items, called the **child arrays**.
100
101Each logical data type has a well-defined physical layout. Here are
102the different physical layouts defined by Arrow:
103
104* **Primitive (fixed-size)**: a sequence of values each having the
105  same byte or bit width
106* **Variable-size Binary**: a sequence of values each having a variable
107  byte length. Two variants of this layout are supported using 32-bit
108  and 64-bit length encoding.
109* **Fixed-size List**: a nested layout where each value has the same
110  number of elements taken from a child data type.
111* **Variable-size List**: a nested layout where each value is a
112  variable-length sequence of values taken from a child data type. Two
113  variants of this layout are supported using 32-bit and 64-bit length
114  encoding.
115* **Struct**: a nested layout consisting of a collection of named
116  child **fields** each having the same length but possibly different
117  types.
118* **Sparse** and **Dense Union**: a nested layout representing a
119  sequence of values, each of which can have type chosen from a
120  collection of child array types.
121* **Null**: a sequence of all null values, having null logical type
122
123The Arrow columnar memory layout only applies to *data* and not
124*metadata*. Implementations are free to represent metadata in-memory
125in whichever form is convenient for them. We handle metadata
126**serialization** in an implementation-independent way using
127`Flatbuffers`_, detailed below.
128
129Buffer Alignment and Padding
130----------------------------
131
132Implementations are recommended to allocate memory on aligned
133addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a
134length that is a multiple of 8 or 64 bytes. When serializing Arrow
135data for interprocess communication, these alignment and padding
136requirements are enforced. If possible, we suggest that you prefer
137using 64-byte alignment and padding. Unless otherwise noted, padded
138bytes do not need to have a specific value.
139
140The alignment requirement follows best practices for optimized memory
141access:
142
143* Elements in numeric arrays will be guaranteed to be retrieved via aligned access.
144* On some architectures alignment can help limit partially used cache lines.
145
146The recommendation for 64 byte alignment comes from the `Intel
147performance guide`_ that recommends alignment of memory to match SIMD
148register width.  The specific padding length was chosen because it
149matches the largest SIMD instruction registers available on widely
150deployed x86 architecture (Intel AVX-512).
151
152The recommended padding of 64 bytes allows for using `SIMD`_
153instructions consistently in loops without additional conditional
154checks.  This should allow for simpler, efficient and CPU
155cache-friendly code.  In other words, we can load the entire 64-byte
156buffer into a 512-bit wide SIMD register and get data-level
157parallelism on all the columnar values packed into the 64-byte
158buffer. Guaranteed padding can also allow certain compilers to
159generate more optimized code directly (e.g. One can safely use Intel's
160``-qopt-assume-safe-padding``).
161
162Array lengths
163-------------
164
165Array lengths are represented in the Arrow metadata as a 64-bit signed
166integer. An implementation of Arrow is considered valid even if it only
167supports lengths up to the maximum 32-bit signed integer, though. If using
168Arrow in a multi-language environment, we recommend limiting lengths to
1692 :sup:`31` - 1 elements or less. Larger data sets can be represented using
170multiple array chunks.
171
172Null count
173----------
174
175The number of null value slots is a property of the physical array and
176considered part of the data structure. The null count is represented
177in the Arrow metadata as a 64-bit signed integer, as it may be as
178large as the array length.
179
180Validity bitmaps
181----------------
182
183Any value in an array may be semantically null, whether primitive or nested
184type.
185
186All array types, with the exception of union types (more on these later),
187utilize a dedicated memory buffer, known as the validity (or "null") bitmap, to
188encode the nullness or non-nullness of each value slot. The validity bitmap
189must be large enough to have at least 1 bit for each array slot.
190
191Whether any array slot is valid (non-null) is encoded in the respective bits of
192this bitmap. A 1 (set bit) for index ``j`` indicates that the value is not null,
193while a 0 (bit not set) indicates that it is null. Bitmaps are to be
194initialized to be all unset at allocation time (this includes padding): ::
195
196    is_valid[j] -> bitmap[j / 8] & (1 << (j % 8))
197
198We use `least-significant bit (LSB) numbering`_ (also known as
199bit-endianness). This means that within a group of 8 bits, we read
200right-to-left: ::
201
202    values = [0, 1, null, 2, null, 3]
203
204    bitmap
205    j mod 8   7  6  5  4  3  2  1  0
206              0  0  1  0  1  0  1  1
207
208Arrays having a 0 null count may choose to not allocate the validity
209bitmap. Implementations may choose to always allocate one anyway as a
210matter of convenience, but this should be noted when memory is being
211shared.
212
213Nested type arrays except for union types have their own validity bitmap and
214null count regardless of the null count and valid bits of their child arrays.
215
216Array slots which are null are not required to have a particular
217value; any "masked" memory can have any value and need not be zeroed,
218though implementations frequently choose to zero memory for null
219values.
220
221Fixed-size Primitive Layout
222---------------------------
223
224A primitive value array represents an array of values each having the
225same physical slot width typically measured in bytes, though the spec
226also provides for bit-packed types (e.g. boolean values encoded in
227bits).
228
229Internally, the array contains a contiguous memory buffer whose total
230size is at least as large as the slot width multiplied by the array
231length. For bit-packed types, the size is rounded up to the nearest
232byte.
233
234The associated validity bitmap is contiguously allocated (as described
235above) but does not need to be adjacent in memory to the values
236buffer.
237
238**Example Layout: Int32 Array**
239
240For example a primitive array of int32s: ::
241
242    [1, null, 2, 4, 8]
243
244Would look like: ::
245
246    * Length: 5, Null count: 1
247    * Validity bitmap buffer:
248
249      |Byte 0 (validity bitmap) | Bytes 1-63            |
250      |-------------------------|-----------------------|
251      | 00011101                | 0 (padding)           |
252
253    * Value Buffer:
254
255      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
256      |------------|-------------|-------------|-------------|-------------|-------------|
257      | 1          | unspecified | 2           | 4           | 8           | unspecified |
258
259**Example Layout: Non-null int32 Array**
260
261``[1, 2, 3, 4, 8]`` has two possible layouts: ::
262
263    * Length: 5, Null count: 0
264    * Validity bitmap buffer:
265
266      | Byte 0 (validity bitmap) | Bytes 1-63            |
267      |--------------------------|-----------------------|
268      | 00011111                 | 0 (padding)           |
269
270    * Value Buffer:
271
272      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | bytes 12-15 | bytes 16-19 | Bytes 20-63 |
273      |------------|-------------|-------------|-------------|-------------|-------------|
274      | 1          | 2           | 3           | 4           | 8           | unspecified |
275
276or with the bitmap elided: ::
277
278    * Length 5, Null count: 0
279    * Validity bitmap buffer: Not required
280    * Value Buffer:
281
282      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | bytes 12-15 | bytes 16-19 | Bytes 20-63 |
283      |------------|-------------|-------------|-------------|-------------|-------------|
284      | 1          | 2           | 3           | 4           | 8           | unspecified |
285
286Variable-size Binary Layout
287---------------------------
288
289Each value in this layout consists of 0 or more bytes. While primitive
290arrays have a single values buffer, variable-size binary have an
291**offsets** buffer and **data** buffer.
292
293The offsets buffer contains `length + 1` signed integers (either
29432-bit or 64-bit, depending on the logical type), which encode the
295start position of each slot in the data buffer. The length of the
296value in each slot is computed using the difference between the offset
297at that slot's index and the subsequent offset. For example, the
298position and length of slot j is computed as:
299
300::
301
302    slot_position = offsets[j]
303    slot_length = offsets[j + 1] - offsets[j]  // (for 0 <= j < length)
304
305It should be noted that a null value may have a positive slot length.
306That is, a null value may occupy a **non-empty** memory space in the data
307buffer. When this is true, the content of the corresponding memory space
308is undefined.
309
310Generally the first value in the offsets array is 0, and the last slot
311is the length of the values array. When serializing this layout, we
312recommend normalizing the offsets to start at 0.
313
314Variable-size List Layout
315-------------------------
316
317List is a nested type which is semantically similar to variable-size
318binary. It is defined by two buffers, a validity bitmap and an offsets
319buffer, and a child array. The offsets are the same as in the
320variable-size binary case, and both 32-bit and 64-bit signed integer
321offsets are supported options for the offsets. Rather than referencing
322an additional data buffer, instead these offsets reference the child
323array.
324
325Similar to the layout of variable-size binary, a null value may
326correspond to a **non-empty** segment in the child array. When this is
327true, the content of the corresponding segment can be arbitrary.
328
329A list type is specified like ``List<T>``, where ``T`` is any type
330(primitive or nested). In these examples we use 32-bit offsets where
331the 64-bit offset version would be denoted by ``LargeList<T>``.
332
333**Example Layout: ``List<Int8>`` Array**
334
335We illustrate an example of ``List<Int8>`` with length 4 having values::
336
337    [[12, -7, 25], null, [0, -127, 127, 50], []]
338
339will have the following representation: ::
340
341    * Length: 4, Null count: 1
342    * Validity bitmap buffer:
343
344      | Byte 0 (validity bitmap) | Bytes 1-63            |
345      |--------------------------|-----------------------|
346      | 00001101                 | 0 (padding)           |
347
348    * Offsets buffer (int32)
349
350      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
351      |------------|-------------|-------------|-------------|-------------|-------------|
352      | 0          | 3           | 3           | 7           | 7           | unspecified |
353
354    * Values array (Int8array):
355      * Length: 7,  Null count: 0
356      * Validity bitmap buffer: Not required
357      * Values buffer (int8)
358
359        | Bytes 0-6                    | Bytes 7-63  |
360        |------------------------------|-------------|
361        | 12, -7, 25, 0, -127, 127, 50 | unspecified |
362
363**Example Layout: ``List<List<Int8>>``**
364
365``[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]]``
366
367will be represented as follows: ::
368
369    * Length 3
370    * Nulls count: 0
371    * Validity bitmap buffer: Not required
372    * Offsets buffer (int32)
373
374      | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
375      |------------|------------|------------|-------------|-------------|
376      | 0          |  2         |  5         |  6          | unspecified |
377
378    * Values array (`List<Int8>`)
379      * Length: 6, Null count: 1
380      * Validity bitmap buffer:
381
382        | Byte 0 (validity bitmap) | Bytes 1-63  |
383        |--------------------------|-------------|
384        | 00110111                 | 0 (padding) |
385
386      * Offsets buffer (int32)
387
388        | Bytes 0-27           | Bytes 28-63 |
389        |----------------------|-------------|
390        | 0, 2, 4, 7, 7, 8, 10 | unspecified |
391
392      * Values array (Int8):
393        * Length: 10, Null count: 0
394        * Validity bitmap buffer: Not required
395
396          | Bytes 0-9                     | Bytes 10-63 |
397          |-------------------------------|-------------|
398          | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified |
399
400Fixed-Size List Layout
401----------------------
402
403Fixed-Size List is a nested type in which each array slot contains a
404fixed-size sequence of values all having the same type.
405
406A fixed size list type is specified like ``FixedSizeList<T>[N]``,
407where ``T`` is any type (primitive or nested) and ``N`` is a 32-bit
408signed integer representing the length of the lists.
409
410A fixed size list array is represented by a values array, which is a
411child array of type T. T may also be a nested type. The value in slot
412``j`` of a fixed size list array is stored in an ``N``-long slice of
413the values array, starting at an offset of ``j * N``.
414
415**Example Layout: ``FixedSizeList<byte>[4]`` Array**
416
417Here we illustrate ``FixedSizeList<byte>[4]``.
418
419For an array of length 4 with respective values: ::
420
421    [[192, 168, 0, 12], null, [192, 168, 0, 25], [192, 168, 0, 1]]
422
423will have the following representation: ::
424
425    * Length: 4, Null count: 1
426    * Validity bitmap buffer:
427
428      | Byte 0 (validity bitmap) | Bytes 1-63            |
429      |--------------------------|-----------------------|
430      | 00001101                 | 0 (padding)           |
431
432    * Values array (byte array):
433      * Length: 16,  Null count: 0
434      * validity bitmap buffer: Not required
435
436        | Bytes 0-3       | Bytes 4-7   | Bytes 8-15                      |
437        |-----------------|-------------|---------------------------------|
438        | 192, 168, 0, 12 | unspecified | 192, 168, 0, 25, 192, 168, 0, 1 |
439
440
441Struct Layout
442-------------
443
444A struct is a nested type parameterized by an ordered sequence of
445types (which can all be distinct), called its fields. Each field must
446have a UTF8-encoded name, and these field names are part of the type
447metadata.
448
449A struct array does not have any additional allocated physical storage
450for its values.  A struct array must still have an allocated validity
451bitmap, if it has one or more null values.
452
453Physically, a struct array has one child array for each field. The
454child arrays are independent and need not be adjacent to each other in
455memory.
456
457For example, the struct (field names shown here as strings for illustration
458purposes)::
459
460    Struct <
461      name: VarBinary
462      age: Int32
463    >
464
465has two child arrays, one ``VarBinary`` array (using variable-size binary
466layout) and one 4-byte primitive value array having ``Int32`` logical
467type.
468
469**Example Layout: ``Struct<VarBinary, Int32>``**
470
471The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
472
473    * Length: 4, Null count: 1
474    * Validity bitmap buffer:
475
476      |Byte 0 (validity bitmap) | Bytes 1-63            |
477      |-------------------------|-----------------------|
478      | 00001011                | 0 (padding)           |
479
480    * Children arrays:
481      * field-0 array (`VarBinary`):
482        * Length: 4, Null count: 2
483        * Validity bitmap buffer:
484
485          | Byte 0 (validity bitmap) | Bytes 1-63            |
486          |--------------------------|-----------------------|
487          | 00001001                 | 0 (padding)           |
488
489        * Offsets buffer:
490
491          | Bytes 0-19     |
492          |----------------|
493          | 0, 3, 3, 3, 7  |
494
495         * Values array:
496            * Length: 7, Null count: 0
497            * Validity bitmap buffer: Not required
498
499            * Value buffer:
500
501              | Bytes 0-6      |
502              |----------------|
503              | joemark        |
504
505      * field-1 array (int32 array):
506        * Length: 4, Null count: 1
507        * Validity bitmap buffer:
508
509          | Byte 0 (validity bitmap) | Bytes 1-63            |
510          |--------------------------|-----------------------|
511          | 00001011                 | 0 (padding)           |
512
513        * Value Buffer:
514
515          |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63 |
516          |------------|-------------|-------------|-------------|-------------|
517          | 1          | 2           | unspecified | 4           | unspecified |
518
519While a struct does not have physical storage for each of its semantic
520slots (i.e. each scalar C-like struct), an entire struct slot can be
521set to null via the validity bitmap. Any of the child field arrays can
522have null values according to their respective independent validity
523bitmaps. This implies that for a particular struct slot the validity
524bitmap for the struct array might indicate a null slot when one or
525more of its child arrays has a non-null value in their corresponding
526slot.  When reading the struct array the parent validity bitmap takes
527priority.  This is illustrated in the example above, the child arrays
528have valid entries for the null struct but are 'hidden' from the
529consumer by the parent array's validity bitmap.  However, when treated
530independently corresponding values of the children array will be
531non-null.
532
533Union Layout
534------------
535
536A union is defined by an ordered sequence of types; each slot in the
537union can have a value chosen from these types. The types are named
538like a struct's fields, and the names are part of the type metadata.
539
540Unlike other data types, unions do not have their own validity bitmap. Instead,
541the nullness of each slot is determined exclusively by the child arrays which
542are composed to create the union.
543
544We define two distinct union types, "dense" and "sparse", that are
545optimized for different use cases.
546
547Dense Union
548~~~~~~~~~~~
549
550Dense union represents a mixed-type array with 5 bytes of overhead for
551each value. Its physical layout is as follows:
552
553* One child array for each type
554* Types buffer: A buffer of 8-bit signed integers. Each type in the
555  union has a corresponding type id whose values are found in this
556  buffer. A union with more than 127 possible types can be modeled as
557  a union of unions.
558* Offsets buffer: A buffer of signed int32 values indicating the
559  relative offset into the respective child array for the type in a
560  given slot. The respective offsets for each child value array must
561  be in order / increasing.
562
563Critically, the dense union allows for minimal overhead in the ubiquitous
564union-of-structs with non-overlapping-fields use case (``Union<s1: Struct1, s2:
565Struct2, s3: Struct3, ...>``)
566
567**Example Layout: Dense union**
568
569An example layout for logical union of: ``Union<f: float, i: int32>``
570having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
571
572::
573
574    * Length: 4, Null count: 0
575    * Types buffer:
576
577      |Byte 0   | Byte 1      | Byte 2   | Byte 3   | Bytes 4-63  |
578      |---------|-------------|----------|----------|-------------|
579      | 0       | 0           | 0        | 1        | unspecified |
580
581    * Offset buffer:
582
583      |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
584      |----------|-------------|------------|-------------|-------------|
585      | 0        | 1           | 2          | 0           | unspecified |
586
587    * Children arrays:
588      * Field-0 array (f: float):
589        * Length: 2, Null count: 1
590        * Validity bitmap buffer: 00000101
591
592        * Value Buffer:
593
594          | Bytes 0-11     | Bytes 12-63  |
595          |----------------|-------------|
596          | 1.2, null, 3.4 | unspecified |
597
598
599      * Field-1 array (i: int32):
600        * Length: 1, Null count: 0
601        * Validity bitmap buffer: Not required
602
603        * Value Buffer:
604
605          | Bytes 0-3 | Bytes 4-63  |
606          |-----------|-------------|
607          | 5         | unspecified |
608
609Sparse Union
610~~~~~~~~~~~~
611
612A sparse union has the same structure as a dense union, with the omission of
613the offsets array. In this case, the child arrays are each equal in length to
614the length of the union.
615
616While a sparse union may use significantly more space compared with a
617dense union, it has some advantages that may be desirable in certain
618use cases:
619
620* A sparse union is more amenable to vectorized expression evaluation in some use cases.
621* Equal-length arrays can be interpreted as a union by only defining the types array.
622
623**Example layout: ``SparseUnion<u0: Int32, u1: Float, u2: VarBinary>``**
624
625For the union array: ::
626
627    [{u0=5}, {u1=1.2}, {u2='joe'}, {u1=3.4}, {u0=4}, {u2='mark'}]
628
629will have the following layout: ::
630
631    * Length: 6, Null count: 0
632    * Types buffer:
633
634     | Byte 0     | Byte 1      | Byte 2      | Byte 3      | Byte 4      | Byte 5       | Bytes  6-63           |
635     |------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
636     | 0          | 1           | 2           | 1           | 0           | 2            | unspecified (padding) |
637
638    * Children arrays:
639
640      * u0 (Int32):
641        * Length: 6, Null count: 4
642        * Validity bitmap buffer:
643
644          |Byte 0 (validity bitmap) | Bytes 1-63            |
645          |-------------------------|-----------------------|
646          |00010001                 | 0 (padding)           |
647
648        * Value buffer:
649
650          |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23  | Bytes 24-63           |
651          |------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
652          | 5          | unspecified | unspecified | unspecified | 4           |  unspecified | unspecified (padding) |
653
654      * u1 (float):
655        * Length: 6, Null count: 4
656        * Validity bitmap buffer:
657
658          |Byte 0 (validity bitmap) | Bytes 1-63            |
659          |-------------------------|-----------------------|
660          | 00001010                | 0 (padding)           |
661
662        * Value buffer:
663
664          |Bytes 0-3    | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23  | Bytes 24-63           |
665          |-------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
666          | unspecified |  1.2        | unspecified | 3.4         | unspecified |  unspecified | unspecified (padding) |
667
668      * u2 (`VarBinary`)
669        * Length: 6, Null count: 4
670        * Validity bitmap buffer:
671
672          | Byte 0 (validity bitmap) | Bytes 1-63            |
673          |--------------------------|-----------------------|
674          | 00100100                 | 0 (padding)           |
675
676        * Offsets buffer (int32)
677
678          | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63 |
679          |------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|
680          | 0          | 0           | 0           | 3           | 3           | 3           | 7           | unspecified |
681
682        * Values array (VarBinary):
683          * Length: 7,  Null count: 0
684          * Validity bitmap buffer: Not required
685
686            | Bytes 0-6  | Bytes 7-63            |
687            |------------|-----------------------|
688            | joemark    | unspecified (padding) |
689
690Only the slot in the array corresponding to the type index is considered. All
691"unselected" values are ignored and could be any semantically correct array
692value.
693
694Null Layout
695-----------
696
697We provide a simplified memory-efficient layout for the Null data type
698where all values are null. In this case no memory buffers are
699allocated.
700
701.. _dictionary-encoded-layout:
702
703Dictionary-encoded Layout
704-------------------------
705
706Dictionary encoding is a data representation technique to represent
707values by integers referencing a **dictionary** usually consisting of
708unique values. It can be effective when you have data with many
709repeated values.
710
711Any array can be dictionary-encoded. The dictionary is stored as an optional
712property of an array. When a field is dictionary encoded, the values are
713represented by an array of non-negative integers representing the index of the
714value in the dictionary. The memory layout for a dictionary-encoded array is
715the same as that of a primitive integer layout. The dictionary is handled as a
716separate columnar array with its own respective layout.
717
718As an example, you could have the following data: ::
719
720    type: VarBinary
721
722    ['foo', 'bar', 'foo', 'bar', null, 'baz']
723
724In dictionary-encoded form, this could appear as:
725
726::
727
728    data VarBinary (dictionary-encoded)
729       index_type: Int32
730       values: [0, 1, 0, 1, null, 2]
731
732    dictionary
733       type: VarBinary
734       values: ['foo', 'bar', 'baz']
735
736Note that a dictionary is permitted to contain duplicate values or
737nulls:
738
739::
740
741    data VarBinary (dictionary-encoded)
742       index_type: Int32
743       values: [0, 1, 3, 1, 4, 2]
744
745    dictionary
746       type: VarBinary
747       values: ['foo', 'bar', 'baz', 'foo', null]
748
749The null count of such arrays is dictated only by the validity bitmap
750of its indices, irrespective of any null values in the dictionary.
751
752Since unsigned integers can be more difficult to work with in some cases
753(e.g. in the JVM), we recommend preferring signed integers over unsigned
754integers for representing dictionary indices. Additionally, we recommend
755avoiding using 64-bit unsigned integer indices unless they are required by an
756application.
757
758We discuss dictionary encoding as it relates to serialization further
759below.
760
761Buffer Listing for Each Layout
762------------------------------
763
764For the avoidance of ambiguity, we provide listing the order and type
765of memory buffers for each layout.
766
767.. csv-table:: Buffer Layouts
768   :header: "Layout Type", "Buffer 0", "Buffer 1", "Buffer 2"
769   :widths: 30, 20, 20, 20
770
771   "Primitive",validity,data,
772   "Variable Binary",validity,offsets,data
773   "List",validity,offsets,
774   "Fixed-size List",validity,,
775   "Struct",validity,,
776   "Sparse Union",type ids,,
777   "Dense Union",type ids,offsets,
778   "Null",,,
779   "Dictionary-encoded",validity,data (indices),
780
781Logical Types
782=============
783
784The `Schema.fbs`_ defines built-in logical types supported by the
785Arrow columnar format. Each logical type uses one of the above
786physical layouts. Nested logical types may have different physical
787layouts depending on the particular realization of the type.
788
789We do not go into detail about the logical types definitions in this
790document as we consider `Schema.fbs`_ to be authoritative.
791
792.. _format-ipc:
793
794Serialization and Interprocess Communication (IPC)
795==================================================
796
797The primitive unit of serialized data in the columnar format is the
798"record batch". Semantically, a record batch is an ordered collection
799of arrays, known as its **fields**, each having the same length as one
800another but potentially different data types. A record batch's field
801names and types collectively form the batch's **schema**.
802
803In this section we define a protocol for serializing record batches
804into a stream of binary payloads and reconstructing record batches
805from these payloads without need for memory copying.
806
807The columnar IPC protocol utilizes a one-way stream of binary messages
808of these types:
809
810* Schema
811* RecordBatch
812* DictionaryBatch
813
814We specify a so-called *encapsulated IPC message* format which
815includes a serialized Flatbuffer type along with an optional message
816body. We define this message format before describing how to serialize
817each constituent IPC message type.
818
819Encapsulated message format
820---------------------------
821
822For simple streaming and file-based serialization, we define a
823"encapsulated" message format for interprocess communication. Such
824messages can be "deserialized" into in-memory Arrow array objects by
825examining only the message metadata without any need to copy or move
826any of the actual data.
827
828The encapsulated binary message format is as follows:
829
830* A 32-bit continuation indicator. The value ``0xFFFFFFFF`` indicates
831  a valid message. This component was introduced in version 0.15.0 in
832  part to address the 8-byte alignment requirement of Flatbuffers
833* A 32-bit little-endian length prefix indicating the metadata size
834* The message metadata as using the ``Message`` type defined in
835  `Message.fbs`_
836* Padding bytes to an 8-byte boundary
837* The message body, whose length must be a multiple of 8 bytes
838
839Schematically, we have: ::
840
841    <continuation: 0xFFFFFFFF>
842    <metadata_size: int32>
843    <metadata_flatbuffer: bytes>
844    <padding>
845    <message body>
846
847The complete serialized message must be a multiple of 8 bytes so that messages
848can be relocated between streams. Otherwise the amount of padding between the
849metadata and the message body could be non-deterministic.
850
851The ``metadata_size`` includes the size of the ``Message`` plus
852padding. The ``metadata_flatbuffer`` contains a serialized ``Message``
853Flatbuffer value, which internally includes:
854
855* A version number
856* A particular message value (one of ``Schema``, ``RecordBatch``, or
857  ``DictionaryBatch``)
858* The size of the message body
859* A ``custom_metadata`` field for any application-supplied metadata
860
861When read from an input stream, generally the ``Message`` metadata is
862initially parsed and validated to obtain the body size. Then the body
863can be read.
864
865Schema message
866--------------
867
868The Flatbuffers files `Schema.fbs`_ contains the definitions for all
869built-in logical data types and the ``Schema`` metadata type which
870represents the schema of a given record batch. A schema consists of
871an ordered sequence of fields, each having a name and type. A
872serialized ``Schema`` does not contain any data buffers, only type
873metadata.
874
875The ``Field`` Flatbuffers type contains the metadata for a single
876array. This includes:
877
878* The field's name
879* The field's logical type
880* Whether the field is semantically nullable. While this has no
881  bearing on the array's physical layout, many systems distinguish
882  nullable and non-nullable fields and we want to allow them to
883  preserve this metadata to enable faithful schema round trips.
884* A collection of child ``Field`` values, for nested types
885* A ``dictionary`` property indicating whether the field is
886  dictionary-encoded or not. If it is, a dictionary "id" is assigned
887  to allow matching a subsequent dictionary IPC message with the
888  appropriate field.
889
890We additionally provide both schema-level and field-level
891``custom_metadata`` attributes allowing for systems to insert their
892own application defined metadata to customize behavior.
893
894RecordBatch message
895-------------------
896
897A RecordBatch message contains the actual data buffers corresponding
898to the physical memory layout determined by a schema. The metadata for
899this message provides the location and size of each buffer, permitting
900Array data structures to be reconstructed using pointer arithmetic and
901thus no memory copying.
902
903The serialized form of the record batch is the following:
904
905* The ``data header``, defined as the ``RecordBatch`` type in
906  `Message.fbs`_.
907* The ``body``, a flat sequence of memory buffers written end-to-end
908  with appropriate padding to ensure a minimum of 8-byte alignment
909
910The data header contains the following:
911
912* The length and null count for each flattened field in the record
913  batch
914* The memory offset and length of each constituent ``Buffer`` in the
915  record batch's body
916
917Fields and buffers are flattened by a pre-order depth-first traversal
918of the fields in the record batch. For example, let's consider the
919schema ::
920
921    col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
922    col2: Utf8
923
924The flattened version of this is: ::
925
926    FieldNode 0: Struct name='col1'
927    FieldNode 1: Int32 name='a'
928    FieldNode 2: List name='b'
929    FieldNode 3: Int64 name='item'
930    FieldNode 4: Float64 name='c'
931    FieldNode 5: Utf8 name='col2'
932
933For the buffers produced, we would have the following (refer to the
934table above): ::
935
936    buffer 0: field 0 validity
937    buffer 1: field 1 validity
938    buffer 2: field 1 values
939    buffer 3: field 2 validity
940    buffer 4: field 2 offsets
941    buffer 5: field 3 validity
942    buffer 6: field 3 values
943    buffer 7: field 4 validity
944    buffer 8: field 4 values
945    buffer 9: field 5 validity
946    buffer 10: field 5 offsets
947    buffer 11: field 5 data
948
949The ``Buffer`` Flatbuffers value describes the location and size of a
950piece of memory. Generally these are interpreted relative to the
951**encapsulated message format** defined below.
952
953The ``size`` field of ``Buffer`` is not required to account for padding
954bytes. Since this metadata can be used to communicate in-memory pointer
955addresses between libraries, it is recommended to set ``size`` to the actual
956memory size rather than the padded size.
957
958Byte Order (`Endianness`_)
959---------------------------
960
961The Arrow format is little endian by default.
962
963Serialized Schema metadata has an endianness field indicating
964endianness of RecordBatches. Typically this is the endianness of the
965system where the RecordBatch was generated. The main use case is
966exchanging RecordBatches between systems with the same Endianness.  At
967first we will return an error when trying to read a Schema with an
968endianness that does not match the underlying system. The reference
969implementation is focused on Little Endian and provides tests for
970it. Eventually we may provide automatic conversion via byte swapping.
971
972IPC Streaming Format
973--------------------
974
975We provide a streaming protocol or "format" for record batches. It is
976presented as a sequence of encapsulated messages, each of which
977follows the format above. The schema comes first in the stream, and it
978is the same for all of the record batches that follow. If any fields
979in the schema are dictionary-encoded, one or more ``DictionaryBatch``
980messages will be included. ``DictionaryBatch`` and ``RecordBatch``
981messages may be interleaved, but before any dictionary key is used in
982a ``RecordBatch`` it should be defined in a ``DictionaryBatch``. ::
983
984    <SCHEMA>
985    <DICTIONARY 0>
986    ...
987    <DICTIONARY k - 1>
988    <RECORD BATCH 0>
989    ...
990    <DICTIONARY x DELTA>
991    ...
992    <DICTIONARY y DELTA>
993    ...
994    <RECORD BATCH n - 1>
995    <EOS [optional]: 0xFFFFFFFF 0x00000000>
996
997.. note:: An edge-case for interleaved dictionary and record batches occurs
998   when the record batches contain dictionary encoded arrays that are
999   completely null. In this case, the dictionary for the encoded column might
1000   appear after the first record batch.
1001
1002When a stream reader implementation is reading a stream, after each
1003message, it may read the next 8 bytes to determine both if the stream
1004continues and the size of the message metadata that follows. Once the
1005message flatbuffer is read, you can then read the message body.
1006
1007The stream writer can signal end-of-stream (EOS) either by writing 8 bytes
1008containing the 4-byte continuation indicator (``0xFFFFFFFF``) followed by 0
1009metadata length (``0x00000000``) or closing the stream interface.
1010
1011IPC File Format
1012---------------
1013
1014We define a "file format" supporting random access that is build with
1015the stream format. The file starts and ends with a magic string
1016``ARROW1`` (plus padding). What follows in the file is identical to
1017the stream format. At the end of the file, we write a *footer*
1018containing a redundant copy of the schema (which is a part of the
1019streaming format) plus memory offsets and sizes for each of the data
1020blocks in the file. This enables random access any record batch in the
1021file. See ``File.fbs`` for the precise details of the file footer.
1022
1023Schematically we have: ::
1024
1025    <magic number "ARROW1">
1026    <empty padding bytes [to 8 byte boundary]>
1027    <STREAMING FORMAT with EOS>
1028    <FOOTER>
1029    <FOOTER SIZE: int32>
1030    <magic number "ARROW1">
1031
1032In the file format, there is no requirement that dictionary keys
1033should be defined in a ``DictionaryBatch`` before they are used in a
1034``RecordBatch``, as long as the keys are defined somewhere in the
1035file. Further more, it is invalid to have more than one **non-delta**
1036dictionary batch per dictionary ID (i.e. dictionary replacement is not
1037supported).  Delta dictionaries are applied in the order they appear in
1038the file footer.
1039
1040Dictionary Messages
1041-------------------
1042
1043Dictionaries are written in the stream and file formats as a sequence of record
1044batches, each having a single field. The complete semantic schema for a
1045sequence of record batches, therefore, consists of the schema along with all of
1046the dictionaries. The dictionary types are found in the schema, so it is
1047necessary to read the schema to first determine the dictionary types so that
1048the dictionaries can be properly interpreted: ::
1049
1050    table DictionaryBatch {
1051      id: long;
1052      data: RecordBatch;
1053      isDelta: boolean = false;
1054    }
1055
1056The dictionary ``id`` in the message metadata can be referenced one or more times
1057in the schema, so that dictionaries can even be used for multiple fields. See
1058the :ref:`dictionary-encoded-layout` section for more about the semantics of
1059dictionary-encoded data.
1060
1061The dictionary ``isDelta`` flag allows existing dictionaries to be
1062expanded for future record batch materializations. A dictionary batch
1063with ``isDelta`` set indicates that its vector should be concatenated
1064with those of any previous batches with the same ``id``. In a stream
1065which encodes one column, the list of strings ``["A", "B", "C", "B",
1066"D", "C", "E", "A"]``, with a delta dictionary batch could take the
1067form: ::
1068
1069    <SCHEMA>
1070    <DICTIONARY 0>
1071    (0) "A"
1072    (1) "B"
1073    (2) "C"
1074
1075    <RECORD BATCH 0>
1076    0
1077    1
1078    2
1079    1
1080
1081    <DICTIONARY 0 DELTA>
1082    (3) "D"
1083    (4) "E"
1084
1085    <RECORD BATCH 1>
1086    3
1087    2
1088    4
1089    0
1090    EOS
1091
1092Alternatively, if ``isDelta`` is set to false, then the dictionary
1093replaces the existing dictionary for the same ID.  Using the same
1094example as above, an alternate encoding could be: ::
1095
1096
1097    <SCHEMA>
1098    <DICTIONARY 0>
1099    (0) "A"
1100    (1) "B"
1101    (2) "C"
1102
1103    <RECORD BATCH 0>
1104    0
1105    1
1106    2
1107    1
1108
1109    <DICTIONARY 0>
1110    (0) "A"
1111    (1) "C"
1112    (2) "D"
1113    (3) "E"
1114
1115    <RECORD BATCH 1>
1116    2
1117    1
1118    3
1119    0
1120    EOS
1121
1122
1123Custom Application Metadata
1124---------------------------
1125
1126We provide a ``custom_metadata`` field at three levels to provide a
1127mechanism for developers to pass application-specific metadata in
1128Arrow protocol messages. This includes ``Field``, ``Schema``, and
1129``Message``.
1130
1131The colon symbol ``:`` is to be used as a namespace separator. It can
1132be used multiple times in a key.
1133
1134The ``ARROW`` pattern is a reserved namespace for internal Arrow use
1135in the ``custom_metadata`` fields. For example,
1136``ARROW:extension:name``.
1137
1138.. _format_metadata_extension_types:
1139
1140Extension Types
1141---------------
1142
1143User-defined "extension" types can be defined setting certain
1144``KeyValue`` pairs in ``custom_metadata`` in the ``Field`` metadata
1145structure. These extension keys are:
1146
1147* ``'ARROW:extension:name'`` for the string name identifying the
1148  custom data type. We recommend that you use a "namespace"-style
1149  prefix for extension type names to minimize the possibility of
1150  conflicts with multiple Arrow readers and writers in the same
1151  application. For example, use ``myorg.name_of_type`` instead of
1152  simply ``name_of_type``
1153* ``'ARROW:extension:metadata'`` for a serialized representation
1154  of the ``ExtensionType`` necessary to reconstruct the custom type
1155
1156This extension metadata can annotate any of the built-in Arrow logical
1157types. The intent is that an implementation that does not support an
1158extension type can still handle the underlying data. For example a
115916-byte UUID value could be embedded in ``FixedSizeBinary(16)``, and
1160implementations that do not have this extension type can still work
1161with the underlying binary values and pass along the
1162``custom_metadata`` in subsequent Arrow protocol messages.
1163
1164Extension types may or may not use the
1165``'ARROW:extension:metadata'`` field. Let's consider some example
1166extension types:
1167
1168* ``uuid`` represented as ``FixedSizeBinary(16)`` with empty metadata
1169* ``latitude-longitude`` represented as ``struct<latitude: double,
1170  longitude: double>``, and empty metadata
1171* ``tensor`` (multidimensional array) stored as ``Binary`` values and
1172  having serialized metadata indicating the data type and shape of
1173  each value. This could be JSON like ``{'type': 'int8', 'shape': [4,
1174  5]}`` for a 4x5 cell tensor.
1175* ``trading-time`` represented as ``Timestamp`` with serialized
1176  metadata indicating the market trading calendar the data corresponds
1177  to
1178
1179Implementation guidelines
1180=========================
1181
1182An execution engine (or framework, or UDF executor, or storage engine,
1183etc) can implement only a subset of the Arrow spec and/or extend it
1184given the following constraints:
1185
1186Implementing a subset the spec
1187------------------------------
1188
1189* **If only producing (and not consuming) arrow vectors**: Any subset
1190  of the vector spec and the corresponding metadata can be implemented.
1191* **If consuming and producing vectors**: There is a minimal subset of
1192  vectors to be supported.  Production of a subset of vectors and
1193  their corresponding metadata is always fine.  Consumption of vectors
1194  should at least convert the unsupported input vectors to the
1195  supported subset (for example Timestamp.millis to timestamp.micros
1196  or int32 to int64).
1197
1198Extensibility
1199-------------
1200
1201An execution engine implementor can also extend their memory
1202representation with their own vectors internally as long as they are
1203never exposed. Before sending data to another system expecting Arrow
1204data, these custom vectors should be converted to a type that exist in
1205the Arrow spec.
1206
1207References
1208----------
1209* Apache Drill Documentation - `Value Vectors`_
1210
1211.. _Flatbuffers: http://github.com/google/flatbuffers
1212.. _Flatbuffers protocol definition files: https://github.com/apache/arrow/tree/master/format
1213.. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs
1214.. _Message.fbs: https://github.com/apache/arrow/blob/master/format/Message.fbs
1215.. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering
1216.. _Intel performance guide: https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
1217.. _Endianness: https://en.wikipedia.org/wiki/Endianness
1218.. _SIMD: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-introduction-to-the-simd-data-layout-templates
1219.. _Parquet: https://parquet.apache.org/documentation/latest/
1220.. _Value Vectors: https://drill.apache.org/docs/value-vectors/
1221