1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements.  See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership.  The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License.  You may obtain a copy of the License at
8
9..   http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied.  See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18.. currentmodule:: pyarrow
19.. _data:
20
21Data Types and In-Memory Data Model
22===================================
23
24Apache Arrow defines columnar array data structures by composing type metadata
25with memory buffers, like the ones explained in the documentation on
26:ref:`Memory and IO <io>`. These data structures are exposed in Python through
27a series of interrelated classes:
28
29* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical
30  array type
31* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named
32  collection of types. These can be thought of as the column types in a
33  table-like object.
34* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous
35  columnar data structures composed from Arrow Buffer objects
36* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a
37  collection of Array objects with a particular Schema
38* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in
39  which each column consists of one or more ``pyarrow.Array`` objects of the
40  same type.
41
42We will examine these in the sections below in a series of examples.
43
44.. _data.types:
45
46Type Metadata
47-------------
48
49Apache Arrow defines language agnostic column-oriented data structures for
50array data. These include:
51
52* **Fixed-length primitive types**: numbers, booleans, date and times, fixed
53  size binary, decimals, and other values that fit into a given number
54* **Variable-length primitive types**: binary, string
55* **Nested types**: list, struct, and union
56* **Dictionary type**: An encoded categorical type (more on this later)
57
58Each logical data type in Arrow has a corresponding factory function for
59creating an instance of that type object in Python:
60
61.. ipython:: python
62
63   import pyarrow as pa
64   t1 = pa.int32()
65   t2 = pa.string()
66   t3 = pa.binary()
67   t4 = pa.binary(10)
68   t5 = pa.timestamp('ms')
69
70   t1
71   print(t1)
72   print(t4)
73   print(t5)
74
75We use the name **logical type** because the **physical** storage may be the
76same for one or more types. For example, ``int64``, ``float64``, and
77``timestamp[ms]`` all occupy 64 bits per value.
78
79These objects are `metadata`; they are used for describing the data in arrays,
80schemas, and record batches. In Python, they can be used in functions where the
81input data (e.g. Python objects) may be coerced to more than one Arrow type.
82
83The :class:`~pyarrow.Field` type is a type plus a name and optional
84user-defined metadata:
85
86.. ipython:: python
87
88   f0 = pa.field('int32_field', t1)
89   f0
90   f0.name
91   f0.type
92
93Arrow supports **nested value types** like list, struct, and union. When
94creating these, you must pass types or fields to indicate the data types of the
95types' children. For example, we can define a list of int32 values with:
96
97.. ipython:: python
98
99   t6 = pa.list_(t1)
100   t6
101
102A `struct` is a collection of named fields:
103
104.. ipython:: python
105
106   fields = [
107       pa.field('s0', t1),
108       pa.field('s1', t2),
109       pa.field('s2', t4),
110       pa.field('s3', t6),
111   ]
112
113   t7 = pa.struct(fields)
114   print(t7)
115
116For convenience, you can pass ``(name, type)`` tuples directly instead of
117:class:`~pyarrow.Field` instances:
118
119.. ipython:: python
120
121   t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)])
122   print(t8)
123   t8 == t7
124
125
126See :ref:`Data Types API <api.types>` for a full listing of data type
127functions.
128
129.. _data.schema:
130
131Schemas
132-------
133
134The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it
135defines the column names and types in a record batch or table data
136structure. The :func:`pyarrow.schema` factory function makes new Schema objects in
137Python:
138
139.. ipython:: python
140
141   my_schema = pa.schema([('field0', t1),
142                          ('field1', t2),
143                          ('field2', t4),
144                          ('field3', t6)])
145   my_schema
146
147In some applications, you may not create schemas directly, only using the ones
148that are embedded in :ref:`IPC messages <ipc>`.
149
150.. _data.array:
151
152Arrays
153------
154
155For each data type, there is an accompanying array data structure for holding
156memory buffers that define a single contiguous chunk of columnar array
157data. When you are using PyArrow, this data may come from IPC tools, though it
158can also be created from various types of Python sequences (lists, NumPy
159arrays, pandas data).
160
161A simple way to create arrays is with ``pyarrow.array``, which is similar to
162the ``numpy.array`` function.  By default PyArrow will infer the data type
163for you:
164
165.. ipython:: python
166
167   arr = pa.array([1, 2, None, 3])
168   arr
169
170But you may also pass a specific data type to override type inference:
171
172.. ipython:: python
173
174   pa.array([1, 2], type=pa.uint16())
175
176The array's ``type`` attribute is the corresponding piece of type metadata:
177
178.. ipython:: python
179
180   arr.type
181
182Each in-memory array has a known length and null count (which will be 0 if
183there are no null values):
184
185.. ipython:: python
186
187   len(arr)
188   arr.null_count
189
190Scalar values can be selected with normal indexing.  ``pyarrow.array`` converts
191``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for
192nulls:
193
194.. ipython:: python
195
196   arr[0]
197   arr[2]
198
199Arrow data is immutable, so values can be selected but not assigned.
200
201Arrays can be sliced without copying:
202
203.. ipython:: python
204
205   arr[1:3]
206
207None values and NAN handling
208~~~~~~~~~~~~~~~~~~~~~~~~~~~~
209
210As mentioned in the above section, the Python object ``None`` is always
211converted to an Arrow null element on the conversion to ``pyarrow.Array``. For
212the float NaN value which is either represented by the Python object
213``float('nan')`` or ``numpy.nan`` we normally convert it to a *valid* float
214value during the conversion. If an integer input is supplied to
215``pyarrow.array`` that contains ``np.nan``, ``ValueError`` is raised.
216
217To handle better compatibility with Pandas, we support interpreting NaN values as
218null elements. This is enabled automatically on all ``from_pandas`` function and
219can be enable on the other conversion functions by passing ``from_pandas=True``
220as a function parameter.
221
222List arrays
223~~~~~~~~~~~
224
225``pyarrow.array`` is able to infer the type of simple nested data structures
226like lists:
227
228.. ipython:: python
229
230   nested_arr = pa.array([[], None, [1, 2], [None, 1]])
231   print(nested_arr.type)
232
233Struct arrays
234~~~~~~~~~~~~~
235
236For other kinds of nested arrays, such as struct arrays, you currently need
237to pass the type explicitly.  Struct arrays can be initialized from a
238sequence of Python dicts or tuples:
239
240.. ipython:: python
241
242   ty = pa.struct([('x', pa.int8()),
243                   ('y', pa.bool_())])
244   pa.array([{'x': 1, 'y': True}, {'x': 2, 'y': False}], type=ty)
245   pa.array([(3, True), (4, False)], type=ty)
246
247When initializing a struct array, nulls are allowed both at the struct
248level and at the individual field level.  If initializing from a sequence
249of Python dicts, a missing dict key is handled as a null value:
250
251.. ipython:: python
252
253   pa.array([{'x': 1}, None, {'y': None}], type=ty)
254
255You can also construct a struct array from existing arrays for each of the
256struct's components.  In this case, data storage will be shared with the
257individual arrays, and no copy is involved:
258
259.. ipython:: python
260
261   xs = pa.array([5, 6, 7], type=pa.int16())
262   ys = pa.array([False, True, True])
263   arr = pa.StructArray.from_arrays((xs, ys), names=('x', 'y'))
264   arr.type
265   arr
266
267Union arrays
268~~~~~~~~~~~~
269
270The union type represents a nested array type where each value can be one
271(and only one) of a set of possible types.  There are two possible
272storage types for union arrays: sparse and dense.
273
274In a sparse union array, each of the child arrays has the same length
275as the resulting union array.  They are adjuncted with a ``int8`` "types"
276array that tells, for each value, from which child array it must be
277selected:
278
279.. ipython:: python
280
281   xs = pa.array([5, 6, 7])
282   ys = pa.array([False, False, True])
283   types = pa.array([0, 1, 1], type=pa.int8())
284   union_arr = pa.UnionArray.from_sparse(types, [xs, ys])
285   union_arr.type
286   union_arr
287
288In a dense union array, you also pass, in addition to the ``int8`` "types"
289array, a ``int32`` "offsets" array that tells, for each value, at
290each offset in the selected child array it can be found:
291
292.. ipython:: python
293
294   xs = pa.array([5, 6, 7])
295   ys = pa.array([False, True])
296   types = pa.array([0, 1, 1, 0, 0], type=pa.int8())
297   offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32())
298   union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys])
299   union_arr.type
300   union_arr
301
302.. _data.dictionary:
303
304Dictionary Arrays
305~~~~~~~~~~~~~~~~~
306
307The **Dictionary** type in PyArrow is a special array type that is similar to a
308factor in R or a ``pandas.Categorical``. It enables one or more record batches
309in a file or stream to transmit integer *indices* referencing a shared
310**dictionary** containing the distinct values in the logical array. This is
311particularly often used with strings to save memory and improve performance.
312
313The way that dictionaries are handled in the Apache Arrow format and the way
314they appear in C++ and Python is slightly different. We define a special
315:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's
316consider an example:
317
318.. ipython:: python
319
320   indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])
321   dictionary = pa.array(['foo', 'bar', 'baz'])
322
323   dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
324   dict_array
325
326Here we have:
327
328.. ipython:: python
329
330   print(dict_array.type)
331   dict_array.indices
332   dict_array.dictionary
333
334When using :class:`~.DictionaryArray` with pandas, the analogue is
335``pandas.Categorical`` (more on this later):
336
337.. ipython:: python
338
339   dict_array.to_pandas()
340
341.. _data.record_batch:
342
343Record Batches
344--------------
345
346A **Record Batch** in Apache Arrow is a collection of equal-length array
347instances. Let's consider a collection of arrays:
348
349.. ipython:: python
350
351   data = [
352       pa.array([1, 2, 3, 4]),
353       pa.array(['foo', 'bar', 'baz', None]),
354       pa.array([True, None, False, True])
355   ]
356
357A record batch can be created from this list of arrays using
358``RecordBatch.from_arrays``:
359
360.. ipython:: python
361
362   batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
363   batch.num_columns
364   batch.num_rows
365   batch.schema
366
367   batch[1]
368
369A record batch can be sliced without copying memory like an array:
370
371.. ipython:: python
372
373   batch2 = batch.slice(1, 3)
374   batch2[1]
375
376.. _data.table:
377
378Tables
379------
380
381The PyArrow :class:`~.Table` type is not part of the Apache Arrow
382specification, but is rather a tool to help with wrangling multiple record
383batches and array pieces as a single logical dataset. As a relevant example, we
384may receive multiple small record batches in a socket stream, then need to
385concatenate them into contiguous memory for use in NumPy or pandas. The Table
386object makes this efficient without requiring additional memory copying.
387
388Considering the record batch we created above, we can create a Table containing
389one or more copies of the batch using ``Table.from_batches``:
390
391.. ipython:: python
392
393   batches = [batch] * 5
394   table = pa.Table.from_batches(batches)
395   table
396   table.num_rows
397
398The table's columns are instances of :class:`~.ChunkedArray`, which is a
399container for one or more arrays of the same type.
400
401.. ipython:: python
402
403   c = table[0]
404   c
405   c.num_chunks
406   c.chunk(0)
407
408As you'll see in the :ref:`pandas section <pandas_interop>`, we can convert
409these objects to contiguous NumPy arrays for use in pandas:
410
411.. ipython:: python
412
413   c.to_pandas()
414
415Multiple tables can also be concatenated together to form a single table using
416``pyarrow.concat_tables``, if the schemas are equal:
417
418.. ipython:: python
419
420   tables = [table] * 2
421   table_all = pa.concat_tables(tables)
422   table_all.num_rows
423   c = table_all[0]
424   c.num_chunks
425
426This is similar to ``Table.from_batches``, but uses tables as input instead of
427record batches. Record batches can be made into tables, but not the other way
428around, so if your data is already in table form, then use
429``pyarrow.concat_tables``.
430
431Custom Schema and Field Metadata
432--------------------------------
433
434TODO
435