1.. Licensed to the Apache Software Foundation (ASF) under one 2.. or more contributor license agreements. See the NOTICE file 3.. distributed with this work for additional information 4.. regarding copyright ownership. The ASF licenses this file 5.. to you under the Apache License, Version 2.0 (the 6.. "License"); you may not use this file except in compliance 7.. with the License. You may obtain a copy of the License at 8 9.. http://www.apache.org/licenses/LICENSE-2.0 10 11.. Unless required by applicable law or agreed to in writing, 12.. software distributed under the License is distributed on an 13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14.. KIND, either express or implied. See the License for the 15.. specific language governing permissions and limitations 16.. under the License. 17 18.. currentmodule:: pyarrow 19.. _data: 20 21Data Types and In-Memory Data Model 22=================================== 23 24Apache Arrow defines columnar array data structures by composing type metadata 25with memory buffers, like the ones explained in the documentation on 26:ref:`Memory and IO <io>`. These data structures are exposed in Python through 27a series of interrelated classes: 28 29* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical 30 array type 31* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named 32 collection of types. These can be thought of as the column types in a 33 table-like object. 34* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous 35 columnar data structures composed from Arrow Buffer objects 36* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a 37 collection of Array objects with a particular Schema 38* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in 39 which each column consists of one or more ``pyarrow.Array`` objects of the 40 same type. 41 42We will examine these in the sections below in a series of examples. 43 44.. _data.types: 45 46Type Metadata 47------------- 48 49Apache Arrow defines language agnostic column-oriented data structures for 50array data. These include: 51 52* **Fixed-length primitive types**: numbers, booleans, date and times, fixed 53 size binary, decimals, and other values that fit into a given number 54* **Variable-length primitive types**: binary, string 55* **Nested types**: list, struct, and union 56* **Dictionary type**: An encoded categorical type (more on this later) 57 58Each logical data type in Arrow has a corresponding factory function for 59creating an instance of that type object in Python: 60 61.. ipython:: python 62 63 import pyarrow as pa 64 t1 = pa.int32() 65 t2 = pa.string() 66 t3 = pa.binary() 67 t4 = pa.binary(10) 68 t5 = pa.timestamp('ms') 69 70 t1 71 print(t1) 72 print(t4) 73 print(t5) 74 75We use the name **logical type** because the **physical** storage may be the 76same for one or more types. For example, ``int64``, ``float64``, and 77``timestamp[ms]`` all occupy 64 bits per value. 78 79These objects are `metadata`; they are used for describing the data in arrays, 80schemas, and record batches. In Python, they can be used in functions where the 81input data (e.g. Python objects) may be coerced to more than one Arrow type. 82 83The :class:`~pyarrow.Field` type is a type plus a name and optional 84user-defined metadata: 85 86.. ipython:: python 87 88 f0 = pa.field('int32_field', t1) 89 f0 90 f0.name 91 f0.type 92 93Arrow supports **nested value types** like list, struct, and union. When 94creating these, you must pass types or fields to indicate the data types of the 95types' children. For example, we can define a list of int32 values with: 96 97.. ipython:: python 98 99 t6 = pa.list_(t1) 100 t6 101 102A `struct` is a collection of named fields: 103 104.. ipython:: python 105 106 fields = [ 107 pa.field('s0', t1), 108 pa.field('s1', t2), 109 pa.field('s2', t4), 110 pa.field('s3', t6), 111 ] 112 113 t7 = pa.struct(fields) 114 print(t7) 115 116For convenience, you can pass ``(name, type)`` tuples directly instead of 117:class:`~pyarrow.Field` instances: 118 119.. ipython:: python 120 121 t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)]) 122 print(t8) 123 t8 == t7 124 125 126See :ref:`Data Types API <api.types>` for a full listing of data type 127functions. 128 129.. _data.schema: 130 131Schemas 132------- 133 134The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it 135defines the column names and types in a record batch or table data 136structure. The :func:`pyarrow.schema` factory function makes new Schema objects in 137Python: 138 139.. ipython:: python 140 141 my_schema = pa.schema([('field0', t1), 142 ('field1', t2), 143 ('field2', t4), 144 ('field3', t6)]) 145 my_schema 146 147In some applications, you may not create schemas directly, only using the ones 148that are embedded in :ref:`IPC messages <ipc>`. 149 150.. _data.array: 151 152Arrays 153------ 154 155For each data type, there is an accompanying array data structure for holding 156memory buffers that define a single contiguous chunk of columnar array 157data. When you are using PyArrow, this data may come from IPC tools, though it 158can also be created from various types of Python sequences (lists, NumPy 159arrays, pandas data). 160 161A simple way to create arrays is with ``pyarrow.array``, which is similar to 162the ``numpy.array`` function. By default PyArrow will infer the data type 163for you: 164 165.. ipython:: python 166 167 arr = pa.array([1, 2, None, 3]) 168 arr 169 170But you may also pass a specific data type to override type inference: 171 172.. ipython:: python 173 174 pa.array([1, 2], type=pa.uint16()) 175 176The array's ``type`` attribute is the corresponding piece of type metadata: 177 178.. ipython:: python 179 180 arr.type 181 182Each in-memory array has a known length and null count (which will be 0 if 183there are no null values): 184 185.. ipython:: python 186 187 len(arr) 188 arr.null_count 189 190Scalar values can be selected with normal indexing. ``pyarrow.array`` converts 191``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for 192nulls: 193 194.. ipython:: python 195 196 arr[0] 197 arr[2] 198 199Arrow data is immutable, so values can be selected but not assigned. 200 201Arrays can be sliced without copying: 202 203.. ipython:: python 204 205 arr[1:3] 206 207None values and NAN handling 208~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 209 210As mentioned in the above section, the Python object ``None`` is always 211converted to an Arrow null element on the conversion to ``pyarrow.Array``. For 212the float NaN value which is either represented by the Python object 213``float('nan')`` or ``numpy.nan`` we normally convert it to a *valid* float 214value during the conversion. If an integer input is supplied to 215``pyarrow.array`` that contains ``np.nan``, ``ValueError`` is raised. 216 217To handle better compatibility with Pandas, we support interpreting NaN values as 218null elements. This is enabled automatically on all ``from_pandas`` function and 219can be enable on the other conversion functions by passing ``from_pandas=True`` 220as a function parameter. 221 222List arrays 223~~~~~~~~~~~ 224 225``pyarrow.array`` is able to infer the type of simple nested data structures 226like lists: 227 228.. ipython:: python 229 230 nested_arr = pa.array([[], None, [1, 2], [None, 1]]) 231 print(nested_arr.type) 232 233Struct arrays 234~~~~~~~~~~~~~ 235 236For other kinds of nested arrays, such as struct arrays, you currently need 237to pass the type explicitly. Struct arrays can be initialized from a 238sequence of Python dicts or tuples: 239 240.. ipython:: python 241 242 ty = pa.struct([('x', pa.int8()), 243 ('y', pa.bool_())]) 244 pa.array([{'x': 1, 'y': True}, {'x': 2, 'y': False}], type=ty) 245 pa.array([(3, True), (4, False)], type=ty) 246 247When initializing a struct array, nulls are allowed both at the struct 248level and at the individual field level. If initializing from a sequence 249of Python dicts, a missing dict key is handled as a null value: 250 251.. ipython:: python 252 253 pa.array([{'x': 1}, None, {'y': None}], type=ty) 254 255You can also construct a struct array from existing arrays for each of the 256struct's components. In this case, data storage will be shared with the 257individual arrays, and no copy is involved: 258 259.. ipython:: python 260 261 xs = pa.array([5, 6, 7], type=pa.int16()) 262 ys = pa.array([False, True, True]) 263 arr = pa.StructArray.from_arrays((xs, ys), names=('x', 'y')) 264 arr.type 265 arr 266 267Union arrays 268~~~~~~~~~~~~ 269 270The union type represents a nested array type where each value can be one 271(and only one) of a set of possible types. There are two possible 272storage types for union arrays: sparse and dense. 273 274In a sparse union array, each of the child arrays has the same length 275as the resulting union array. They are adjuncted with a ``int8`` "types" 276array that tells, for each value, from which child array it must be 277selected: 278 279.. ipython:: python 280 281 xs = pa.array([5, 6, 7]) 282 ys = pa.array([False, False, True]) 283 types = pa.array([0, 1, 1], type=pa.int8()) 284 union_arr = pa.UnionArray.from_sparse(types, [xs, ys]) 285 union_arr.type 286 union_arr 287 288In a dense union array, you also pass, in addition to the ``int8`` "types" 289array, a ``int32`` "offsets" array that tells, for each value, at 290each offset in the selected child array it can be found: 291 292.. ipython:: python 293 294 xs = pa.array([5, 6, 7]) 295 ys = pa.array([False, True]) 296 types = pa.array([0, 1, 1, 0, 0], type=pa.int8()) 297 offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32()) 298 union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys]) 299 union_arr.type 300 union_arr 301 302 303Dictionary Arrays 304~~~~~~~~~~~~~~~~~ 305 306The **Dictionary** type in PyArrow is a special array type that is similar to a 307factor in R or a ``pandas.Categorical``. It enables one or more record batches 308in a file or stream to transmit integer *indices* referencing a shared 309**dictionary** containing the distinct values in the logical array. This is 310particularly often used with strings to save memory and improve performance. 311 312The way that dictionaries are handled in the Apache Arrow format and the way 313they appear in C++ and Python is slightly different. We define a special 314:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's 315consider an example: 316 317.. ipython:: python 318 319 indices = pa.array([0, 1, 0, 1, 2, 0, None, 2]) 320 dictionary = pa.array(['foo', 'bar', 'baz']) 321 322 dict_array = pa.DictionaryArray.from_arrays(indices, dictionary) 323 dict_array 324 325Here we have: 326 327.. ipython:: python 328 329 print(dict_array.type) 330 dict_array.indices 331 dict_array.dictionary 332 333When using :class:`~.DictionaryArray` with pandas, the analogue is 334``pandas.Categorical`` (more on this later): 335 336.. ipython:: python 337 338 dict_array.to_pandas() 339 340.. _data.record_batch: 341 342Record Batches 343-------------- 344 345A **Record Batch** in Apache Arrow is a collection of equal-length array 346instances. Let's consider a collection of arrays: 347 348.. ipython:: python 349 350 data = [ 351 pa.array([1, 2, 3, 4]), 352 pa.array(['foo', 'bar', 'baz', None]), 353 pa.array([True, None, False, True]) 354 ] 355 356A record batch can be created from this list of arrays using 357``RecordBatch.from_arrays``: 358 359.. ipython:: python 360 361 batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) 362 batch.num_columns 363 batch.num_rows 364 batch.schema 365 366 batch[1] 367 368A record batch can be sliced without copying memory like an array: 369 370.. ipython:: python 371 372 batch2 = batch.slice(1, 3) 373 batch2[1] 374 375.. _data.table: 376 377Tables 378------ 379 380The PyArrow :class:`~.Table` type is not part of the Apache Arrow 381specification, but is rather a tool to help with wrangling multiple record 382batches and array pieces as a single logical dataset. As a relevant example, we 383may receive multiple small record batches in a socket stream, then need to 384concatenate them into contiguous memory for use in NumPy or pandas. The Table 385object makes this efficient without requiring additional memory copying. 386 387Considering the record batch we created above, we can create a Table containing 388one or more copies of the batch using ``Table.from_batches``: 389 390.. ipython:: python 391 392 batches = [batch] * 5 393 table = pa.Table.from_batches(batches) 394 table 395 table.num_rows 396 397The table's columns are instances of :class:`~.ChunkedArray`, which is a 398container for one or more arrays of the same type. 399 400.. ipython:: python 401 402 c = table[0] 403 c 404 c.num_chunks 405 c.chunk(0) 406 407As you'll see in the :ref:`pandas section <pandas_interop>`, we can convert 408these objects to contiguous NumPy arrays for use in pandas: 409 410.. ipython:: python 411 412 c.to_pandas() 413 414Multiple tables can also be concatenated together to form a single table using 415``pyarrow.concat_tables``, if the schemas are equal: 416 417.. ipython:: python 418 419 tables = [table] * 2 420 table_all = pa.concat_tables(tables) 421 table_all.num_rows 422 c = table_all[0] 423 c.num_chunks 424 425This is similar to ``Table.from_batches``, but uses tables as input instead of 426record batches. Record batches can be made into tables, but not the other way 427around, so if your data is already in table form, then use 428``pyarrow.concat_tables``. 429 430Custom Schema and Field Metadata 431-------------------------------- 432 433TODO 434