1.. Licensed to the Apache Software Foundation (ASF) under one 2.. or more contributor license agreements. See the NOTICE file 3.. distributed with this work for additional information 4.. regarding copyright ownership. The ASF licenses this file 5.. to you under the Apache License, Version 2.0 (the 6.. "License"); you may not use this file except in compliance 7.. with the License. You may obtain a copy of the License at 8 9.. http://www.apache.org/licenses/LICENSE-2.0 10 11.. Unless required by applicable law or agreed to in writing, 12.. software distributed under the License is distributed on an 13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14.. KIND, either express or implied. See the License for the 15.. specific language governing permissions and limitations 16.. under the License. 17 18.. _format_columnar: 19 20********************* 21Arrow Columnar Format 22********************* 23 24*Version: 1.0* 25 26The "Arrow Columnar Format" includes a language-agnostic in-memory 27data structure specification, metadata serialization, and a protocol 28for serialization and generic data transport. 29 30This document is intended to provide adequate detail to create a new 31implementation of the columnar format without the aid of an existing 32implementation. We utilize Google's `Flatbuffers`_ project for 33metadata serialization, so it will be necessary to refer to the 34project's `Flatbuffers protocol definition files`_ 35while reading this document. 36 37The columnar format has some key features: 38 39* Data adjacency for sequential access (scans) 40* O(1) (constant-time) random access 41* SIMD and vectorization-friendly 42* Relocatable without "pointer swizzling", allowing for true zero-copy 43 access in shared memory 44 45The Arrow columnar format provides analytical performance and data 46locality guarantees in exchange for comparatively more expensive 47mutation operations. This document is concerned only with in-memory 48data representation and serialization details; issues such as 49coordinating mutation of data structures are left to be handled by 50implementations. 51 52Terminology 53=========== 54 55Since different projects have used different words to describe various 56concepts, here is a small glossary to help disambiguate. 57 58* **Array** or **Vector**: a sequence of values with known length all 59 having the same type. These terms are used interchangeably in 60 different Arrow implementations, but we use "array" in this 61 document. 62* **Slot**: a single logical value in an array of some particular data type 63* **Buffer** or **Contiguous memory region**: a sequential virtual 64 address space with a given length. Any byte can be reached via a 65 single pointer offset less than the region's length. 66* **Physical Layout**: The underlying memory layout for an array 67 without taking into account any value semantics. For example, a 68 32-bit signed integer array and 32-bit floating point array have the 69 same layout. 70* **Parent** and **child arrays**: names to express relationships 71 between physical value arrays in a nested type structure. For 72 example, a ``List<T>``-type parent array has a T-type array as its 73 child (see more on lists below). 74* **Primitive type**: a data type having no child types. This includes 75 such types as fixed bit-width, variable-size binary, and null types. 76* **Nested type**: a data type whose full structure depends on one or 77 more other child types. Two fully-specified nested types are equal 78 if and only if their child types are equal. For example, ``List<U>`` 79 is distinct from ``List<V>`` iff U and V are different types. 80* **Logical type**: An application-facing semantic value type that is 81 implemented using some physical layout. For example, Decimal 82 values are stored as 16 bytes in a fixed-size binary 83 layout. Similarly, strings can be stored as ``List<1-byte>``. A 84 timestamp may be stored as 64-bit fixed-size layout. 85 86Physical Memory Layout 87====================== 88 89Arrays are defined by a few pieces of metadata and data: 90 91* A logical data type. 92* A sequence of buffers. 93* A length as a 64-bit signed integer. Implementations are permitted 94 to be limited to 32-bit lengths, see more on this below. 95* A null count as a 64-bit signed integer. 96* An optional **dictionary**, for dictionary-encoded arrays. 97 98Nested arrays additionally have a sequence of one or more sets of 99these items, called the **child arrays**. 100 101Each logical data type has a well-defined physical layout. Here are 102the different physical layouts defined by Arrow: 103 104* **Primitive (fixed-size)**: a sequence of values each having the 105 same byte or bit width 106* **Variable-size Binary**: a sequence of values each having a variable 107 byte length. Two variants of this layout are supported using 32-bit 108 and 64-bit length encoding. 109* **Fixed-size List**: a nested layout where each value has the same 110 number of elements taken from a child data type. 111* **Variable-size List**: a nested layout where each value is a 112 variable-length sequence of values taken from a child data type. Two 113 variants of this layout are supported using 32-bit and 64-bit length 114 encoding. 115* **Struct**: a nested layout consisting of a collection of named 116 child **fields** each having the same length but possibly different 117 types. 118* **Sparse** and **Dense Union**: a nested layout representing a 119 sequence of values, each of which can have type chosen from a 120 collection of child array types. 121* **Null**: a sequence of all null values, having null logical type 122 123The Arrow columnar memory layout only applies to *data* and not 124*metadata*. Implementations are free to represent metadata in-memory 125in whichever form is convenient for them. We handle metadata 126**serialization** in an implementation-independent way using 127`Flatbuffers`_, detailed below. 128 129Buffer Alignment and Padding 130---------------------------- 131 132Implementations are recommended to allocate memory on aligned 133addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a 134length that is a multiple of 8 or 64 bytes. When serializing Arrow 135data for interprocess communication, these alignment and padding 136requirements are enforced. If possible, we suggest that you prefer 137using 64-byte alignment and padding. Unless otherwise noted, padded 138bytes do not need to have a specific value. 139 140The alignment requirement follows best practices for optimized memory 141access: 142 143* Elements in numeric arrays will be guaranteed to be retrieved via aligned access. 144* On some architectures alignment can help limit partially used cache lines. 145 146The recommendation for 64 byte alignment comes from the `Intel 147performance guide`_ that recommends alignment of memory to match SIMD 148register width. The specific padding length was chosen because it 149matches the largest SIMD instruction registers available on widely 150deployed x86 architecture (Intel AVX-512). 151 152The recommended padding of 64 bytes allows for using `SIMD`_ 153instructions consistently in loops without additional conditional 154checks. This should allow for simpler, efficient and CPU 155cache-friendly code. In other words, we can load the entire 64-byte 156buffer into a 512-bit wide SIMD register and get data-level 157parallelism on all the columnar values packed into the 64-byte 158buffer. Guaranteed padding can also allow certain compilers to 159generate more optimized code directly (e.g. One can safely use Intel's 160``-qopt-assume-safe-padding``). 161 162Array lengths 163------------- 164 165Array lengths are represented in the Arrow metadata as a 64-bit signed 166integer. An implementation of Arrow is considered valid even if it only 167supports lengths up to the maximum 32-bit signed integer, though. If using 168Arrow in a multi-language environment, we recommend limiting lengths to 1692 :sup:`31` - 1 elements or less. Larger data sets can be represented using 170multiple array chunks. 171 172Null count 173---------- 174 175The number of null value slots is a property of the physical array and 176considered part of the data structure. The null count is represented 177in the Arrow metadata as a 64-bit signed integer, as it may be as 178large as the array length. 179 180Validity bitmaps 181---------------- 182 183Any value in an array may be semantically null, whether primitive or nested 184type. 185 186All array types, with the exception of union types (more on these later), 187utilize a dedicated memory buffer, known as the validity (or "null") bitmap, to 188encode the nullness or non-nullness of each value slot. The validity bitmap 189must be large enough to have at least 1 bit for each array slot. 190 191Whether any array slot is valid (non-null) is encoded in the respective bits of 192this bitmap. A 1 (set bit) for index ``j`` indicates that the value is not null, 193while a 0 (bit not set) indicates that it is null. Bitmaps are to be 194initialized to be all unset at allocation time (this includes padding): :: 195 196 is_valid[j] -> bitmap[j / 8] & (1 << (j % 8)) 197 198We use `least-significant bit (LSB) numbering`_ (also known as 199bit-endianness). This means that within a group of 8 bits, we read 200right-to-left: :: 201 202 values = [0, 1, null, 2, null, 3] 203 204 bitmap 205 j mod 8 7 6 5 4 3 2 1 0 206 0 0 1 0 1 0 1 1 207 208Arrays having a 0 null count may choose to not allocate the validity 209bitmap. Implementations may choose to always allocate one anyway as a 210matter of convenience, but this should be noted when memory is being 211shared. 212 213Nested type arrays except for union types have their own validity bitmap and 214null count regardless of the null count and valid bits of their child arrays. 215 216Array slots which are null are not required to have a particular 217value; any "masked" memory can have any value and need not be zeroed, 218though implementations frequently choose to zero memory for null 219values. 220 221Fixed-size Primitive Layout 222--------------------------- 223 224A primitive value array represents an array of values each having the 225same physical slot width typically measured in bytes, though the spec 226also provides for bit-packed types (e.g. boolean values encoded in 227bits). 228 229Internally, the array contains a contiguous memory buffer whose total 230size is at least as large as the slot width multiplied by the array 231length. For bit-packed types, the size is rounded up to the nearest 232byte. 233 234The associated validity bitmap is contiguously allocated (as described 235above) but does not need to be adjacent in memory to the values 236buffer. 237 238**Example Layout: Int32 Array** 239 240For example a primitive array of int32s: :: 241 242 [1, null, 2, 4, 8] 243 244Would look like: :: 245 246 * Length: 5, Null count: 1 247 * Validity bitmap buffer: 248 249 |Byte 0 (validity bitmap) | Bytes 1-63 | 250 |-------------------------|-----------------------| 251 | 00011101 | 0 (padding) | 252 253 * Value Buffer: 254 255 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | 256 |------------|-------------|-------------|-------------|-------------|-------------| 257 | 1 | unspecified | 2 | 4 | 8 | unspecified | 258 259**Example Layout: Non-null int32 Array** 260 261``[1, 2, 3, 4, 8]`` has two possible layouts: :: 262 263 * Length: 5, Null count: 0 264 * Validity bitmap buffer: 265 266 | Byte 0 (validity bitmap) | Bytes 1-63 | 267 |--------------------------|-----------------------| 268 | 00011111 | 0 (padding) | 269 270 * Value Buffer: 271 272 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | Bytes 20-63 | 273 |------------|-------------|-------------|-------------|-------------|-------------| 274 | 1 | 2 | 3 | 4 | 8 | unspecified | 275 276or with the bitmap elided: :: 277 278 * Length 5, Null count: 0 279 * Validity bitmap buffer: Not required 280 * Value Buffer: 281 282 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | Bytes 20-63 | 283 |------------|-------------|-------------|-------------|-------------|-------------| 284 | 1 | 2 | 3 | 4 | 8 | unspecified | 285 286Variable-size Binary Layout 287--------------------------- 288 289Each value in this layout consists of 0 or more bytes. While primitive 290arrays have a single values buffer, variable-size binary have an 291**offsets** buffer and **data** buffer. 292 293The offsets buffer contains `length + 1` signed integers (either 29432-bit or 64-bit, depending on the logical type), which encode the 295start position of each slot in the data buffer. The length of the 296value in each slot is computed using the difference between the offset 297at that slot's index and the subsequent offset. For example, the 298position and length of slot j is computed as: 299 300:: 301 302 slot_position = offsets[j] 303 slot_length = offsets[j + 1] - offsets[j] // (for 0 <= j < length) 304 305It should be noted that a null value may have a positive slot length. 306That is, a null value may occupy a **non-empty** memory space in the data 307buffer. When this is true, the content of the corresponding memory space 308is undefined. 309 310Generally the first value in the offsets array is 0, and the last slot 311is the length of the values array. When serializing this layout, we 312recommend normalizing the offsets to start at 0. 313 314Variable-size List Layout 315------------------------- 316 317List is a nested type which is semantically similar to variable-size 318binary. It is defined by two buffers, a validity bitmap and an offsets 319buffer, and a child array. The offsets are the same as in the 320variable-size binary case, and both 32-bit and 64-bit signed integer 321offsets are supported options for the offsets. Rather than referencing 322an additional data buffer, instead these offsets reference the child 323array. 324 325Similar to the layout of variable-size binary, a null value may 326correspond to a **non-empty** segment in the child array. When this is 327true, the content of the corresponding segment can be arbitrary. 328 329A list type is specified like ``List<T>``, where ``T`` is any type 330(primitive or nested). In these examples we use 32-bit offsets where 331the 64-bit offset version would be denoted by ``LargeList<T>``. 332 333**Example Layout: ``List<Int8>`` Array** 334 335We illustrate an example of ``List<Int8>`` with length 4 having values:: 336 337 [[12, -7, 25], null, [0, -127, 127, 50], []] 338 339will have the following representation: :: 340 341 * Length: 4, Null count: 1 342 * Validity bitmap buffer: 343 344 | Byte 0 (validity bitmap) | Bytes 1-63 | 345 |--------------------------|-----------------------| 346 | 00001101 | 0 (padding) | 347 348 * Offsets buffer (int32) 349 350 | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | 351 |------------|-------------|-------------|-------------|-------------|-------------| 352 | 0 | 3 | 3 | 7 | 7 | unspecified | 353 354 * Values array (Int8array): 355 * Length: 7, Null count: 0 356 * Validity bitmap buffer: Not required 357 * Values buffer (int8) 358 359 | Bytes 0-6 | Bytes 7-63 | 360 |------------------------------|-------------| 361 | 12, -7, 25, 0, -127, 127, 50 | unspecified | 362 363**Example Layout: ``List<List<Int8>>``** 364 365``[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]]`` 366 367will be represented as follows: :: 368 369 * Length 3 370 * Nulls count: 0 371 * Validity bitmap buffer: Not required 372 * Offsets buffer (int32) 373 374 | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | 375 |------------|------------|------------|-------------|-------------| 376 | 0 | 2 | 5 | 6 | unspecified | 377 378 * Values array (`List<Int8>`) 379 * Length: 6, Null count: 1 380 * Validity bitmap buffer: 381 382 | Byte 0 (validity bitmap) | Bytes 1-63 | 383 |--------------------------|-------------| 384 | 00110111 | 0 (padding) | 385 386 * Offsets buffer (int32) 387 388 | Bytes 0-27 | Bytes 28-63 | 389 |----------------------|-------------| 390 | 0, 2, 4, 7, 7, 8, 10 | unspecified | 391 392 * Values array (Int8): 393 * Length: 10, Null count: 0 394 * Validity bitmap buffer: Not required 395 396 | Bytes 0-9 | Bytes 10-63 | 397 |-------------------------------|-------------| 398 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified | 399 400Fixed-Size List Layout 401---------------------- 402 403Fixed-Size List is a nested type in which each array slot contains a 404fixed-size sequence of values all having the same type. 405 406A fixed size list type is specified like ``FixedSizeList<T>[N]``, 407where ``T`` is any type (primitive or nested) and ``N`` is a 32-bit 408signed integer representing the length of the lists. 409 410A fixed size list array is represented by a values array, which is a 411child array of type T. T may also be a nested type. The value in slot 412``j`` of a fixed size list array is stored in an ``N``-long slice of 413the values array, starting at an offset of ``j * N``. 414 415**Example Layout: ``FixedSizeList<byte>[4]`` Array** 416 417Here we illustrate ``FixedSizeList<byte>[4]``. 418 419For an array of length 4 with respective values: :: 420 421 [[192, 168, 0, 12], null, [192, 168, 0, 25], [192, 168, 0, 1]] 422 423will have the following representation: :: 424 425 * Length: 4, Null count: 1 426 * Validity bitmap buffer: 427 428 | Byte 0 (validity bitmap) | Bytes 1-63 | 429 |--------------------------|-----------------------| 430 | 00001101 | 0 (padding) | 431 432 * Values array (byte array): 433 * Length: 16, Null count: 0 434 * validity bitmap buffer: Not required 435 436 | Bytes 0-3 | Bytes 4-7 | Bytes 8-15 | 437 |-----------------|-------------|---------------------------------| 438 | 192, 168, 0, 12 | unspecified | 192, 168, 0, 25, 192, 168, 0, 1 | 439 440 441Struct Layout 442------------- 443 444A struct is a nested type parameterized by an ordered sequence of 445types (which can all be distinct), called its fields. Each field must 446have a UTF8-encoded name, and these field names are part of the type 447metadata. 448 449A struct array does not have any additional allocated physical storage 450for its values. A struct array must still have an allocated validity 451bitmap, if it has one or more null values. 452 453Physically, a struct array has one child array for each field. The 454child arrays are independent and need not be adjacent to each other in 455memory. 456 457For example, the struct (field names shown here as strings for illustration 458purposes):: 459 460 Struct < 461 name: VarBinary 462 age: Int32 463 > 464 465has two child arrays, one ``VarBinary`` array (using variable-size binary 466layout) and one 4-byte primitive value array having ``Int32`` logical 467type. 468 469**Example Layout: ``Struct<VarBinary, Int32>``** 470 471The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: :: 472 473 * Length: 4, Null count: 1 474 * Validity bitmap buffer: 475 476 |Byte 0 (validity bitmap) | Bytes 1-63 | 477 |-------------------------|-----------------------| 478 | 00001011 | 0 (padding) | 479 480 * Children arrays: 481 * field-0 array (`VarBinary`): 482 * Length: 4, Null count: 2 483 * Validity bitmap buffer: 484 485 | Byte 0 (validity bitmap) | Bytes 1-63 | 486 |--------------------------|-----------------------| 487 | 00001001 | 0 (padding) | 488 489 * Offsets buffer: 490 491 | Bytes 0-19 | 492 |----------------| 493 | 0, 3, 3, 3, 7 | 494 495 * Values array: 496 * Length: 7, Null count: 0 497 * Validity bitmap buffer: Not required 498 499 * Value buffer: 500 501 | Bytes 0-6 | 502 |----------------| 503 | joemark | 504 505 * field-1 array (int32 array): 506 * Length: 4, Null count: 1 507 * Validity bitmap buffer: 508 509 | Byte 0 (validity bitmap) | Bytes 1-63 | 510 |--------------------------|-----------------------| 511 | 00001011 | 0 (padding) | 512 513 * Value Buffer: 514 515 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | 516 |------------|-------------|-------------|-------------|-------------| 517 | 1 | 2 | unspecified | 4 | unspecified | 518 519While a struct does not have physical storage for each of its semantic 520slots (i.e. each scalar C-like struct), an entire struct slot can be 521set to null via the validity bitmap. Any of the child field arrays can 522have null values according to their respective independent validity 523bitmaps. This implies that for a particular struct slot the validity 524bitmap for the struct array might indicate a null slot when one or 525more of its child arrays has a non-null value in their corresponding 526slot. When reading the struct array the parent validity bitmap takes 527priority. This is illustrated in the example above, the child arrays 528have valid entries for the null struct but are 'hidden' from the 529consumer by the parent array's validity bitmap. However, when treated 530independently corresponding values of the children array will be 531non-null. 532 533Union Layout 534------------ 535 536A union is defined by an ordered sequence of types; each slot in the 537union can have a value chosen from these types. The types are named 538like a struct's fields, and the names are part of the type metadata. 539 540Unlike other data types, unions do not have their own validity bitmap. Instead, 541the nullness of each slot is determined exclusively by the child arrays which 542are composed to create the union. 543 544We define two distinct union types, "dense" and "sparse", that are 545optimized for different use cases. 546 547Dense Union 548~~~~~~~~~~~ 549 550Dense union represents a mixed-type array with 5 bytes of overhead for 551each value. Its physical layout is as follows: 552 553* One child array for each type 554* Types buffer: A buffer of 8-bit signed integers. Each type in the 555 union has a corresponding type id whose values are found in this 556 buffer. A union with more than 127 possible types can be modeled as 557 a union of unions. 558* Offsets buffer: A buffer of signed int32 values indicating the 559 relative offset into the respective child array for the type in a 560 given slot. The respective offsets for each child value array must 561 be in order / increasing. 562 563Critically, the dense union allows for minimal overhead in the ubiquitous 564union-of-structs with non-overlapping-fields use case (``Union<s1: Struct1, s2: 565Struct2, s3: Struct3, ...>``) 566 567**Example Layout: Dense union** 568 569An example layout for logical union of: ``Union<f: float, i: int32>`` 570having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` 571 572:: 573 574 * Length: 4, Null count: 0 575 * Types buffer: 576 577 |Byte 0 | Byte 1 | Byte 2 | Byte 3 | Bytes 4-63 | 578 |---------|-------------|----------|----------|-------------| 579 | 0 | 0 | 0 | 1 | unspecified | 580 581 * Offset buffer: 582 583 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | 584 |----------|-------------|------------|-------------|-------------| 585 | 0 | 1 | 2 | 0 | unspecified | 586 587 * Children arrays: 588 * Field-0 array (f: float): 589 * Length: 2, Null count: 1 590 * Validity bitmap buffer: 00000101 591 592 * Value Buffer: 593 594 | Bytes 0-11 | Bytes 12-63 | 595 |----------------|-------------| 596 | 1.2, null, 3.4 | unspecified | 597 598 599 * Field-1 array (i: int32): 600 * Length: 1, Null count: 0 601 * Validity bitmap buffer: Not required 602 603 * Value Buffer: 604 605 | Bytes 0-3 | Bytes 4-63 | 606 |-----------|-------------| 607 | 5 | unspecified | 608 609Sparse Union 610~~~~~~~~~~~~ 611 612A sparse union has the same structure as a dense union, with the omission of 613the offsets array. In this case, the child arrays are each equal in length to 614the length of the union. 615 616While a sparse union may use significantly more space compared with a 617dense union, it has some advantages that may be desirable in certain 618use cases: 619 620* A sparse union is more amenable to vectorized expression evaluation in some use cases. 621* Equal-length arrays can be interpreted as a union by only defining the types array. 622 623**Example layout: ``SparseUnion<u0: Int32, u1: Float, u2: VarBinary>``** 624 625For the union array: :: 626 627 [{u0=5}, {u1=1.2}, {u2='joe'}, {u1=3.4}, {u0=4}, {u2='mark'}] 628 629will have the following layout: :: 630 631 * Length: 6, Null count: 0 632 * Types buffer: 633 634 | Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Bytes 6-63 | 635 |------------|-------------|-------------|-------------|-------------|--------------|-----------------------| 636 | 0 | 1 | 2 | 1 | 0 | 2 | unspecified (padding) | 637 638 * Children arrays: 639 640 * u0 (Int32): 641 * Length: 6, Null count: 4 642 * Validity bitmap buffer: 643 644 |Byte 0 (validity bitmap) | Bytes 1-63 | 645 |-------------------------|-----------------------| 646 |00010001 | 0 (padding) | 647 648 * Value buffer: 649 650 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 | 651 |------------|-------------|-------------|-------------|-------------|--------------|-----------------------| 652 | 5 | unspecified | unspecified | unspecified | 4 | unspecified | unspecified (padding) | 653 654 * u1 (float): 655 * Length: 6, Null count: 4 656 * Validity bitmap buffer: 657 658 |Byte 0 (validity bitmap) | Bytes 1-63 | 659 |-------------------------|-----------------------| 660 | 00001010 | 0 (padding) | 661 662 * Value buffer: 663 664 |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 | 665 |-------------|-------------|-------------|-------------|-------------|--------------|-----------------------| 666 | unspecified | 1.2 | unspecified | 3.4 | unspecified | unspecified | unspecified (padding) | 667 668 * u2 (`VarBinary`) 669 * Length: 6, Null count: 4 670 * Validity bitmap buffer: 671 672 | Byte 0 (validity bitmap) | Bytes 1-63 | 673 |--------------------------|-----------------------| 674 | 00100100 | 0 (padding) | 675 676 * Offsets buffer (int32) 677 678 | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63 | 679 |------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------| 680 | 0 | 0 | 0 | 3 | 3 | 3 | 7 | unspecified | 681 682 * Values array (VarBinary): 683 * Length: 7, Null count: 0 684 * Validity bitmap buffer: Not required 685 686 | Bytes 0-6 | Bytes 7-63 | 687 |------------|-----------------------| 688 | joemark | unspecified (padding) | 689 690Only the slot in the array corresponding to the type index is considered. All 691"unselected" values are ignored and could be any semantically correct array 692value. 693 694Null Layout 695----------- 696 697We provide a simplified memory-efficient layout for the Null data type 698where all values are null. In this case no memory buffers are 699allocated. 700 701.. _dictionary-encoded-layout: 702 703Dictionary-encoded Layout 704------------------------- 705 706Dictionary encoding is a data representation technique to represent 707values by integers referencing a **dictionary** usually consisting of 708unique values. It can be effective when you have data with many 709repeated values. 710 711Any array can be dictionary-encoded. The dictionary is stored as an optional 712property of an array. When a field is dictionary encoded, the values are 713represented by an array of non-negative integers representing the index of the 714value in the dictionary. The memory layout for a dictionary-encoded array is 715the same as that of a primitive integer layout. The dictionary is handled as a 716separate columnar array with its own respective layout. 717 718As an example, you could have the following data: :: 719 720 type: VarBinary 721 722 ['foo', 'bar', 'foo', 'bar', null, 'baz'] 723 724In dictionary-encoded form, this could appear as: 725 726:: 727 728 data VarBinary (dictionary-encoded) 729 index_type: Int32 730 values: [0, 1, 0, 1, null, 2] 731 732 dictionary 733 type: VarBinary 734 values: ['foo', 'bar', 'baz'] 735 736Note that a dictionary is permitted to contain duplicate values or 737nulls: 738 739:: 740 741 data VarBinary (dictionary-encoded) 742 index_type: Int32 743 values: [0, 1, 3, 1, 4, 2] 744 745 dictionary 746 type: VarBinary 747 values: ['foo', 'bar', 'baz', 'foo', null] 748 749The null count of such arrays is dictated only by the validity bitmap 750of its indices, irrespective of any null values in the dictionary. 751 752Since unsigned integers can be more difficult to work with in some cases 753(e.g. in the JVM), we recommend preferring signed integers over unsigned 754integers for representing dictionary indices. Additionally, we recommend 755avoiding using 64-bit unsigned integer indices unless they are required by an 756application. 757 758We discuss dictionary encoding as it relates to serialization further 759below. 760 761Buffer Listing for Each Layout 762------------------------------ 763 764For the avoidance of ambiguity, we provide listing the order and type 765of memory buffers for each layout. 766 767.. csv-table:: Buffer Layouts 768 :header: "Layout Type", "Buffer 0", "Buffer 1", "Buffer 2" 769 :widths: 30, 20, 20, 20 770 771 "Primitive",validity,data, 772 "Variable Binary",validity,offsets,data 773 "List",validity,offsets, 774 "Fixed-size List",validity,, 775 "Struct",validity,, 776 "Sparse Union",type ids,, 777 "Dense Union",type ids,offsets, 778 "Null",,, 779 "Dictionary-encoded",validity,data (indices), 780 781Logical Types 782============= 783 784The `Schema.fbs`_ defines built-in logical types supported by the 785Arrow columnar format. Each logical type uses one of the above 786physical layouts. Nested logical types may have different physical 787layouts depending on the particular realization of the type. 788 789We do not go into detail about the logical types definitions in this 790document as we consider `Schema.fbs`_ to be authoritative. 791 792.. _format-ipc: 793 794Serialization and Interprocess Communication (IPC) 795================================================== 796 797The primitive unit of serialized data in the columnar format is the 798"record batch". Semantically, a record batch is an ordered collection 799of arrays, known as its **fields**, each having the same length as one 800another but potentially different data types. A record batch's field 801names and types collectively form the batch's **schema**. 802 803In this section we define a protocol for serializing record batches 804into a stream of binary payloads and reconstructing record batches 805from these payloads without need for memory copying. 806 807The columnar IPC protocol utilizes a one-way stream of binary messages 808of these types: 809 810* Schema 811* RecordBatch 812* DictionaryBatch 813 814We specify a so-called *encapsulated IPC message* format which 815includes a serialized Flatbuffer type along with an optional message 816body. We define this message format before describing how to serialize 817each constituent IPC message type. 818 819Encapsulated message format 820--------------------------- 821 822For simple streaming and file-based serialization, we define a 823"encapsulated" message format for interprocess communication. Such 824messages can be "deserialized" into in-memory Arrow array objects by 825examining only the message metadata without any need to copy or move 826any of the actual data. 827 828The encapsulated binary message format is as follows: 829 830* A 32-bit continuation indicator. The value ``0xFFFFFFFF`` indicates 831 a valid message. This component was introduced in version 0.15.0 in 832 part to address the 8-byte alignment requirement of Flatbuffers 833* A 32-bit little-endian length prefix indicating the metadata size 834* The message metadata as using the ``Message`` type defined in 835 `Message.fbs`_ 836* Padding bytes to an 8-byte boundary 837* The message body, whose length must be a multiple of 8 bytes 838 839Schematically, we have: :: 840 841 <continuation: 0xFFFFFFFF> 842 <metadata_size: int32> 843 <metadata_flatbuffer: bytes> 844 <padding> 845 <message body> 846 847The complete serialized message must be a multiple of 8 bytes so that messages 848can be relocated between streams. Otherwise the amount of padding between the 849metadata and the message body could be non-deterministic. 850 851The ``metadata_size`` includes the size of the ``Message`` plus 852padding. The ``metadata_flatbuffer`` contains a serialized ``Message`` 853Flatbuffer value, which internally includes: 854 855* A version number 856* A particular message value (one of ``Schema``, ``RecordBatch``, or 857 ``DictionaryBatch``) 858* The size of the message body 859* A ``custom_metadata`` field for any application-supplied metadata 860 861When read from an input stream, generally the ``Message`` metadata is 862initially parsed and validated to obtain the body size. Then the body 863can be read. 864 865Schema message 866-------------- 867 868The Flatbuffers files `Schema.fbs`_ contains the definitions for all 869built-in logical data types and the ``Schema`` metadata type which 870represents the schema of a given record batch. A schema consists of 871an ordered sequence of fields, each having a name and type. A 872serialized ``Schema`` does not contain any data buffers, only type 873metadata. 874 875The ``Field`` Flatbuffers type contains the metadata for a single 876array. This includes: 877 878* The field's name 879* The field's logical type 880* Whether the field is semantically nullable. While this has no 881 bearing on the array's physical layout, many systems distinguish 882 nullable and non-nullable fields and we want to allow them to 883 preserve this metadata to enable faithful schema round trips. 884* A collection of child ``Field`` values, for nested types 885* A ``dictionary`` property indicating whether the field is 886 dictionary-encoded or not. If it is, a dictionary "id" is assigned 887 to allow matching a subsequent dictionary IPC message with the 888 appropriate field. 889 890We additionally provide both schema-level and field-level 891``custom_metadata`` attributes allowing for systems to insert their 892own application defined metadata to customize behavior. 893 894RecordBatch message 895------------------- 896 897A RecordBatch message contains the actual data buffers corresponding 898to the physical memory layout determined by a schema. The metadata for 899this message provides the location and size of each buffer, permitting 900Array data structures to be reconstructed using pointer arithmetic and 901thus no memory copying. 902 903The serialized form of the record batch is the following: 904 905* The ``data header``, defined as the ``RecordBatch`` type in 906 `Message.fbs`_. 907* The ``body``, a flat sequence of memory buffers written end-to-end 908 with appropriate padding to ensure a minimum of 8-byte alignment 909 910The data header contains the following: 911 912* The length and null count for each flattened field in the record 913 batch 914* The memory offset and length of each constituent ``Buffer`` in the 915 record batch's body 916 917Fields and buffers are flattened by a pre-order depth-first traversal 918of the fields in the record batch. For example, let's consider the 919schema :: 920 921 col1: Struct<a: Int32, b: List<item: Int64>, c: Float64> 922 col2: Utf8 923 924The flattened version of this is: :: 925 926 FieldNode 0: Struct name='col1' 927 FieldNode 1: Int32 name='a' 928 FieldNode 2: List name='b' 929 FieldNode 3: Int64 name='item' 930 FieldNode 4: Float64 name='c' 931 FieldNode 5: Utf8 name='col2' 932 933For the buffers produced, we would have the following (refer to the 934table above): :: 935 936 buffer 0: field 0 validity 937 buffer 1: field 1 validity 938 buffer 2: field 1 values 939 buffer 3: field 2 validity 940 buffer 4: field 2 offsets 941 buffer 5: field 3 validity 942 buffer 6: field 3 values 943 buffer 7: field 4 validity 944 buffer 8: field 4 values 945 buffer 9: field 5 validity 946 buffer 10: field 5 offsets 947 buffer 11: field 5 data 948 949The ``Buffer`` Flatbuffers value describes the location and size of a 950piece of memory. Generally these are interpreted relative to the 951**encapsulated message format** defined below. 952 953The ``size`` field of ``Buffer`` is not required to account for padding 954bytes. Since this metadata can be used to communicate in-memory pointer 955addresses between libraries, it is recommended to set ``size`` to the actual 956memory size rather than the padded size. 957 958Byte Order (`Endianness`_) 959--------------------------- 960 961The Arrow format is little endian by default. 962 963Serialized Schema metadata has an endianness field indicating 964endianness of RecordBatches. Typically this is the endianness of the 965system where the RecordBatch was generated. The main use case is 966exchanging RecordBatches between systems with the same Endianness. At 967first we will return an error when trying to read a Schema with an 968endianness that does not match the underlying system. The reference 969implementation is focused on Little Endian and provides tests for 970it. Eventually we may provide automatic conversion via byte swapping. 971 972IPC Streaming Format 973-------------------- 974 975We provide a streaming protocol or "format" for record batches. It is 976presented as a sequence of encapsulated messages, each of which 977follows the format above. The schema comes first in the stream, and it 978is the same for all of the record batches that follow. If any fields 979in the schema are dictionary-encoded, one or more ``DictionaryBatch`` 980messages will be included. ``DictionaryBatch`` and ``RecordBatch`` 981messages may be interleaved, but before any dictionary key is used in 982a ``RecordBatch`` it should be defined in a ``DictionaryBatch``. :: 983 984 <SCHEMA> 985 <DICTIONARY 0> 986 ... 987 <DICTIONARY k - 1> 988 <RECORD BATCH 0> 989 ... 990 <DICTIONARY x DELTA> 991 ... 992 <DICTIONARY y DELTA> 993 ... 994 <RECORD BATCH n - 1> 995 <EOS [optional]: 0xFFFFFFFF 0x00000000> 996 997.. note:: An edge-case for interleaved dictionary and record batches occurs 998 when the record batches contain dictionary encoded arrays that are 999 completely null. In this case, the dictionary for the encoded column might 1000 appear after the first record batch. 1001 1002When a stream reader implementation is reading a stream, after each 1003message, it may read the next 8 bytes to determine both if the stream 1004continues and the size of the message metadata that follows. Once the 1005message flatbuffer is read, you can then read the message body. 1006 1007The stream writer can signal end-of-stream (EOS) either by writing 8 bytes 1008containing the 4-byte continuation indicator (``0xFFFFFFFF``) followed by 0 1009metadata length (``0x00000000``) or closing the stream interface. 1010 1011IPC File Format 1012--------------- 1013 1014We define a "file format" supporting random access that is build with 1015the stream format. The file starts and ends with a magic string 1016``ARROW1`` (plus padding). What follows in the file is identical to 1017the stream format. At the end of the file, we write a *footer* 1018containing a redundant copy of the schema (which is a part of the 1019streaming format) plus memory offsets and sizes for each of the data 1020blocks in the file. This enables random access any record batch in the 1021file. See ``File.fbs`` for the precise details of the file footer. 1022 1023Schematically we have: :: 1024 1025 <magic number "ARROW1"> 1026 <empty padding bytes [to 8 byte boundary]> 1027 <STREAMING FORMAT with EOS> 1028 <FOOTER> 1029 <FOOTER SIZE: int32> 1030 <magic number "ARROW1"> 1031 1032In the file format, there is no requirement that dictionary keys 1033should be defined in a ``DictionaryBatch`` before they are used in a 1034``RecordBatch``, as long as the keys are defined somewhere in the 1035file. Further more, it is invalid to have more than one **non-delta** 1036dictionary batch per dictionary ID (i.e. dictionary replacement is not 1037supported). Delta dictionaries are applied in the order they appear in 1038the file footer. 1039 1040Dictionary Messages 1041------------------- 1042 1043Dictionaries are written in the stream and file formats as a sequence of record 1044batches, each having a single field. The complete semantic schema for a 1045sequence of record batches, therefore, consists of the schema along with all of 1046the dictionaries. The dictionary types are found in the schema, so it is 1047necessary to read the schema to first determine the dictionary types so that 1048the dictionaries can be properly interpreted: :: 1049 1050 table DictionaryBatch { 1051 id: long; 1052 data: RecordBatch; 1053 isDelta: boolean = false; 1054 } 1055 1056The dictionary ``id`` in the message metadata can be referenced one or more times 1057in the schema, so that dictionaries can even be used for multiple fields. See 1058the :ref:`dictionary-encoded-layout` section for more about the semantics of 1059dictionary-encoded data. 1060 1061The dictionary ``isDelta`` flag allows existing dictionaries to be 1062expanded for future record batch materializations. A dictionary batch 1063with ``isDelta`` set indicates that its vector should be concatenated 1064with those of any previous batches with the same ``id``. In a stream 1065which encodes one column, the list of strings ``["A", "B", "C", "B", 1066"D", "C", "E", "A"]``, with a delta dictionary batch could take the 1067form: :: 1068 1069 <SCHEMA> 1070 <DICTIONARY 0> 1071 (0) "A" 1072 (1) "B" 1073 (2) "C" 1074 1075 <RECORD BATCH 0> 1076 0 1077 1 1078 2 1079 1 1080 1081 <DICTIONARY 0 DELTA> 1082 (3) "D" 1083 (4) "E" 1084 1085 <RECORD BATCH 1> 1086 3 1087 2 1088 4 1089 0 1090 EOS 1091 1092Alternatively, if ``isDelta`` is set to false, then the dictionary 1093replaces the existing dictionary for the same ID. Using the same 1094example as above, an alternate encoding could be: :: 1095 1096 1097 <SCHEMA> 1098 <DICTIONARY 0> 1099 (0) "A" 1100 (1) "B" 1101 (2) "C" 1102 1103 <RECORD BATCH 0> 1104 0 1105 1 1106 2 1107 1 1108 1109 <DICTIONARY 0> 1110 (0) "A" 1111 (1) "C" 1112 (2) "D" 1113 (3) "E" 1114 1115 <RECORD BATCH 1> 1116 2 1117 1 1118 3 1119 0 1120 EOS 1121 1122 1123Custom Application Metadata 1124--------------------------- 1125 1126We provide a ``custom_metadata`` field at three levels to provide a 1127mechanism for developers to pass application-specific metadata in 1128Arrow protocol messages. This includes ``Field``, ``Schema``, and 1129``Message``. 1130 1131The colon symbol ``:`` is to be used as a namespace separator. It can 1132be used multiple times in a key. 1133 1134The ``ARROW`` pattern is a reserved namespace for internal Arrow use 1135in the ``custom_metadata`` fields. For example, 1136``ARROW:extension:name``. 1137 1138.. _format_metadata_extension_types: 1139 1140Extension Types 1141--------------- 1142 1143User-defined "extension" types can be defined setting certain 1144``KeyValue`` pairs in ``custom_metadata`` in the ``Field`` metadata 1145structure. These extension keys are: 1146 1147* ``'ARROW:extension:name'`` for the string name identifying the 1148 custom data type. We recommend that you use a "namespace"-style 1149 prefix for extension type names to minimize the possibility of 1150 conflicts with multiple Arrow readers and writers in the same 1151 application. For example, use ``myorg.name_of_type`` instead of 1152 simply ``name_of_type`` 1153* ``'ARROW:extension:metadata'`` for a serialized representation 1154 of the ``ExtensionType`` necessary to reconstruct the custom type 1155 1156This extension metadata can annotate any of the built-in Arrow logical 1157types. The intent is that an implementation that does not support an 1158extension type can still handle the underlying data. For example a 115916-byte UUID value could be embedded in ``FixedSizeBinary(16)``, and 1160implementations that do not have this extension type can still work 1161with the underlying binary values and pass along the 1162``custom_metadata`` in subsequent Arrow protocol messages. 1163 1164Extension types may or may not use the 1165``'ARROW:extension:metadata'`` field. Let's consider some example 1166extension types: 1167 1168* ``uuid`` represented as ``FixedSizeBinary(16)`` with empty metadata 1169* ``latitude-longitude`` represented as ``struct<latitude: double, 1170 longitude: double>``, and empty metadata 1171* ``tensor`` (multidimensional array) stored as ``Binary`` values and 1172 having serialized metadata indicating the data type and shape of 1173 each value. This could be JSON like ``{'type': 'int8', 'shape': [4, 1174 5]}`` for a 4x5 cell tensor. 1175* ``trading-time`` represented as ``Timestamp`` with serialized 1176 metadata indicating the market trading calendar the data corresponds 1177 to 1178 1179Implementation guidelines 1180========================= 1181 1182An execution engine (or framework, or UDF executor, or storage engine, 1183etc) can implement only a subset of the Arrow spec and/or extend it 1184given the following constraints: 1185 1186Implementing a subset the spec 1187------------------------------ 1188 1189* **If only producing (and not consuming) arrow vectors**: Any subset 1190 of the vector spec and the corresponding metadata can be implemented. 1191* **If consuming and producing vectors**: There is a minimal subset of 1192 vectors to be supported. Production of a subset of vectors and 1193 their corresponding metadata is always fine. Consumption of vectors 1194 should at least convert the unsupported input vectors to the 1195 supported subset (for example Timestamp.millis to timestamp.micros 1196 or int32 to int64). 1197 1198Extensibility 1199------------- 1200 1201An execution engine implementor can also extend their memory 1202representation with their own vectors internally as long as they are 1203never exposed. Before sending data to another system expecting Arrow 1204data, these custom vectors should be converted to a type that exist in 1205the Arrow spec. 1206 1207References 1208---------- 1209* Apache Drill Documentation - `Value Vectors`_ 1210 1211.. _Flatbuffers: http://github.com/google/flatbuffers 1212.. _Flatbuffers protocol definition files: https://github.com/apache/arrow/tree/master/format 1213.. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs 1214.. _Message.fbs: https://github.com/apache/arrow/blob/master/format/Message.fbs 1215.. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering 1216.. _Intel performance guide: https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors 1217.. _Endianness: https://en.wikipedia.org/wiki/Endianness 1218.. _SIMD: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-introduction-to-the-simd-data-layout-templates 1219.. _Parquet: https://parquet.apache.org/documentation/latest/ 1220.. _Value Vectors: https://drill.apache.org/docs/value-vectors/ 1221