1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements.  See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership.  The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License.  You may obtain a copy of the License at
8
9..   http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied.  See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18.. currentmodule:: pyarrow
19.. _parquet:
20
21Reading and Writing the Apache Parquet Format
22=============================================
23
24The `Apache Parquet <http://parquet.apache.org/>`_ project provides a
25standardized open-source columnar storage format for use in data analysis
26systems. It was created originally for use in `Apache Hadoop
27<http://hadoop.apache.org/>`_ with systems like `Apache Drill
28<http://drill.apache.org>`_, `Apache Hive <http://hive.apache.org>`_, `Apache
29Impala (incubating) <http://impala.apache.org>`_, and `Apache Spark
30<http://spark.apache.org>`_ adopting it as a shared standard for high
31performance data IO.
32
33Apache Arrow is an ideal in-memory transport layer for data that is being read
34or written with Parquet files. We have been concurrently developing the `C++
35implementation of Apache Parquet <http://github.com/apache/parquet-cpp>`_,
36which includes a native, multithreaded C++ adapter to and from in-memory Arrow
37data. PyArrow includes Python bindings to this code, which thus enables reading
38and writing Parquet files with pandas as well.
39
40Obtaining pyarrow with Parquet Support
41--------------------------------------
42
43If you installed ``pyarrow`` with pip or conda, it should be built with Parquet
44support bundled:
45
46.. ipython:: python
47
48   import pyarrow.parquet as pq
49
50If you are building ``pyarrow`` from source, you must use
51``-DARROW_PARQUET=ON`` when compiling the C++ libraries and enable the Parquet
52extensions when building ``pyarrow``. See the :ref:`Python Development
53<python-development>` page for more details.
54
55Reading and Writing Single Files
56--------------------------------
57
58The functions :func:`~.parquet.read_table` and :func:`~.parquet.write_table`
59read and write the :ref:`pyarrow.Table <data.table>` object, respectively.
60
61Let's look at a simple table:
62
63.. ipython:: python
64
65   import numpy as np
66   import pandas as pd
67   import pyarrow as pa
68
69   df = pd.DataFrame({'one': [-1, np.nan, 2.5],
70                      'two': ['foo', 'bar', 'baz'],
71                      'three': [True, False, True]},
72                      index=list('abc'))
73   table = pa.Table.from_pandas(df)
74
75We write this to Parquet format with ``write_table``:
76
77.. ipython:: python
78
79   import pyarrow.parquet as pq
80   pq.write_table(table, 'example.parquet')
81
82This creates a single Parquet file. In practice, a Parquet dataset may consist
83of many files in many directories. We can read a single file back with
84``read_table``:
85
86.. ipython:: python
87
88   table2 = pq.read_table('example.parquet')
89   table2.to_pandas()
90
91You can pass a subset of columns to read, which can be much faster than reading
92the whole file (due to the columnar layout):
93
94.. ipython:: python
95
96   pq.read_table('example.parquet', columns=['one', 'three'])
97
98When reading a subset of columns from a file that used a Pandas dataframe as the
99source, we use ``read_pandas`` to maintain any additional index column data:
100
101.. ipython:: python
102
103   pq.read_pandas('example.parquet', columns=['two']).to_pandas()
104
105We need not use a string to specify the origin of the file. It can be any of:
106
107* A file path as a string
108* A :ref:`NativeFile <io.native_file>` from PyArrow
109* A Python file object
110
111In general, a Python file object will have the worst read performance, while a
112string file path or an instance of :class:`~.NativeFile` (especially memory
113maps) will perform the best.
114
115.. _parquet_mmap:
116
117Reading Parquet and Memory Mapping
118~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
119
120Because Parquet data needs to be decoded from the Parquet format
121and compression, it can't be directly mapped from disk.
122Thus the ``memory_map`` option might perform better on some systems
123but won't help much with resident memory consumption.
124
125.. code-block:: python
126
127      >>> pq_array = pa.parquet.read_table("area1.parquet", memory_map=True)
128      >>> print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
129      RSS: 4299MB
130
131      >>> pq_array = pa.parquet.read_table("area1.parquet", memory_map=False)
132      >>> print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
133      RSS: 4299MB
134
135If you need to deal with Parquet data bigger than memory,
136the :ref:`dataset` and partitioning is probably what you are looking for.
137
138Parquet file writing options
139~~~~~~~~~~~~~~~~~~~~~~~~~~~~
140
141:func:`~pyarrow.parquet.write_table()` has a number of options to
142control various settings when writing a Parquet file.
143
144* ``version``, the Parquet format version to use.  ``'1.0'`` ensures
145  compatibility with older readers, while ``'2.4'`` and greater values
146  enable more Parquet types and encodings.
147* ``data_page_size``, to control the approximate size of encoded data
148  pages within a column chunk. This currently defaults to 1MB.
149* ``flavor``, to set compatibility options particular to a Parquet
150  consumer like ``'spark'`` for Apache Spark.
151
152See the :func:`~pyarrow.parquet.write_table()` docstring for more details.
153
154There are some additional data type handling-specific options
155described below.
156
157Omitting the DataFrame index
158~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159
160When using ``pa.Table.from_pandas`` to convert to an Arrow table, by default
161one or more special columns are added to keep track of the index (row
162labels). Storing the index takes extra space, so if your index is not valuable,
163you may choose to omit it by passing ``preserve_index=False``
164
165.. ipython:: python
166
167   df = pd.DataFrame({'one': [-1, np.nan, 2.5],
168                      'two': ['foo', 'bar', 'baz'],
169                      'three': [True, False, True]},
170                      index=list('abc'))
171   df
172   table = pa.Table.from_pandas(df, preserve_index=False)
173
174Then we have:
175
176.. ipython:: python
177
178   pq.write_table(table, 'example_noindex.parquet')
179   t = pq.read_table('example_noindex.parquet')
180   t.to_pandas()
181
182Here you see the index did not survive the round trip.
183
184Finer-grained Reading and Writing
185---------------------------------
186
187``read_table`` uses the :class:`~.ParquetFile` class, which has other features:
188
189.. ipython:: python
190
191   parquet_file = pq.ParquetFile('example.parquet')
192   parquet_file.metadata
193   parquet_file.schema
194
195As you can learn more in the `Apache Parquet format
196<https://github.com/apache/parquet-format>`_, a Parquet file consists of
197multiple row groups. ``read_table`` will read all of the row groups and
198concatenate them into a single table. You can read individual row groups with
199``read_row_group``:
200
201.. ipython:: python
202
203   parquet_file.num_row_groups
204   parquet_file.read_row_group(0)
205
206We can similarly write a Parquet file with multiple row groups by using
207``ParquetWriter``:
208
209.. ipython:: python
210
211   with pq.ParquetWriter('example2.parquet', table.schema) as writer:
212      for i in range(3):
213         writer.write_table(table)
214
215   pf2 = pq.ParquetFile('example2.parquet')
216   pf2.num_row_groups
217
218Inspecting the Parquet File Metadata
219------------------------------------
220
221The ``FileMetaData`` of a Parquet file can be accessed through
222:class:`~.ParquetFile` as shown above:
223
224.. ipython:: python
225
226   parquet_file = pq.ParquetFile('example.parquet')
227   metadata = parquet_file.metadata
228
229or can also be read directly using :func:`~parquet.read_metadata`:
230
231.. ipython:: python
232
233   metadata = pq.read_metadata('example.parquet')
234   metadata
235
236The returned ``FileMetaData`` object allows to inspect the
237`Parquet file metadata <https://github.com/apache/parquet-format#metadata>`__,
238such as the row groups and column chunk metadata and statistics:
239
240.. ipython:: python
241
242   metadata.row_group(0)
243   metadata.row_group(0).column(0)
244
245.. ipython:: python
246   :suppress:
247
248   !rm example.parquet
249   !rm example_noindex.parquet
250   !rm example2.parquet
251   !rm example3.parquet
252
253Data Type Handling
254------------------
255
256Reading types as DictionaryArray
257~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
258
259The ``read_dictionary`` option in ``read_table`` and ``ParquetDataset`` will
260cause columns to be read as ``DictionaryArray``, which will become
261``pandas.Categorical`` when converted to pandas. This option is only valid for
262string and binary column types, and it can yield significantly lower memory use
263and improved performance for columns with many repeated string values.
264
265.. code-block:: python
266
267   pq.read_table(table, where, read_dictionary=['binary_c0', 'stringb_c2'])
268
269Storing timestamps
270~~~~~~~~~~~~~~~~~~
271
272Some Parquet readers may only support timestamps stored in millisecond
273(``'ms'``) or microsecond (``'us'``) resolution. Since pandas uses nanoseconds
274to represent timestamps, this can occasionally be a nuisance. By default
275(when writing version 1.0 Parquet files), the nanoseconds will be cast to
276microseconds ('us').
277
278In addition, We provide the ``coerce_timestamps`` option to allow you to select
279the desired resolution:
280
281.. code-block:: python
282
283   pq.write_table(table, where, coerce_timestamps='ms')
284
285If a cast to a lower resolution value may result in a loss of data, by default
286an exception will be raised. This can be suppressed by passing
287``allow_truncated_timestamps=True``:
288
289.. code-block:: python
290
291   pq.write_table(table, where, coerce_timestamps='ms',
292                  allow_truncated_timestamps=True)
293
294Timestamps with nanoseconds can be stored without casting when using the
295more recent Parquet format version 2.6:
296
297.. code-block:: python
298
299   pq.write_table(table, where, version='2.6')
300
301However, many Parquet readers do not yet support this newer format version, and
302therefore the default is to write version 1.0 files. When compatibility across
303different processing frameworks is required, it is recommended to use the
304default version 1.0.
305
306Older Parquet implementations use ``INT96`` based storage of
307timestamps, but this is now deprecated. This includes some older
308versions of Apache Impala and Apache Spark. To write timestamps in
309this format, set the ``use_deprecated_int96_timestamps`` option to
310``True`` in ``write_table``.
311
312.. code-block:: python
313
314   pq.write_table(table, where, use_deprecated_int96_timestamps=True)
315
316Compression, Encoding, and File Compatibility
317---------------------------------------------
318
319The most commonly used Parquet implementations use dictionary encoding when
320writing files; if the dictionaries grow too large, then they "fall back" to
321plain encoding. Whether dictionary encoding is used can be toggled using the
322``use_dictionary`` option:
323
324.. code-block:: python
325
326   pq.write_table(table, where, use_dictionary=False)
327
328The data pages within a column in a row group can be compressed after the
329encoding passes (dictionary, RLE encoding). In PyArrow we use Snappy
330compression by default, but Brotli, Gzip, and uncompressed are also supported:
331
332.. code-block:: python
333
334   pq.write_table(table, where, compression='snappy')
335   pq.write_table(table, where, compression='gzip')
336   pq.write_table(table, where, compression='brotli')
337   pq.write_table(table, where, compression='none')
338
339Snappy generally results in better performance, while Gzip may yield smaller
340files.
341
342These settings can also be set on a per-column basis:
343
344.. code-block:: python
345
346   pq.write_table(table, where, compression={'foo': 'snappy', 'bar': 'gzip'},
347                  use_dictionary=['foo', 'bar'])
348
349Partitioned Datasets (Multiple Files)
350------------------------------------------------
351
352Multiple Parquet files constitute a Parquet *dataset*. These may present in a
353number of ways:
354
355* A list of Parquet absolute file paths
356* A directory name containing nested directories defining a partitioned dataset
357
358A dataset partitioned by year and month may look like on disk:
359
360.. code-block:: text
361
362   dataset_name/
363     year=2007/
364       month=01/
365          0.parq
366          1.parq
367          ...
368       month=02/
369          0.parq
370          1.parq
371          ...
372       month=03/
373       ...
374     year=2008/
375       month=01/
376       ...
377     ...
378
379Writing to Partitioned Datasets
380-------------------------------
381
382You can write a partitioned dataset for any ``pyarrow`` file system that is a
383file-store (e.g. local, HDFS, S3). The default behaviour when no filesystem is
384added is to use the local filesystem.
385
386.. code-block:: python
387
388   # Local dataset write
389   pq.write_to_dataset(table, root_path='dataset_name',
390                       partition_cols=['one', 'two'])
391
392The root path in this case specifies the parent directory to which data will be
393saved. The partition columns are the column names by which to partition the
394dataset. Columns are partitioned in the order they are given. The partition
395splits are determined by the unique values in the partition columns.
396
397To use another filesystem you only need to add the filesystem parameter, the
398individual table writes are wrapped using ``with`` statements so the
399``pq.write_to_dataset`` function does not need to be.
400
401.. code-block:: python
402
403   # Remote file-system example
404   from pyarrow.fs import HadoopFileSystem
405   fs = HadoopFileSystem(host, port, user=user, kerb_ticket=ticket_cache_path)
406   pq.write_to_dataset(table, root_path='dataset_name',
407                       partition_cols=['one', 'two'], filesystem=fs)
408
409Compatibility Note: if using ``pq.write_to_dataset`` to create a table that
410will then be used by HIVE then partition column values must be compatible with
411the allowed character set of the HIVE version you are running.
412
413Writing ``_metadata`` and ``_common_medata`` files
414~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
415
416Some processing frameworks such as Spark or Dask (optionally) use ``_metadata``
417and ``_common_metadata`` files with partitioned datasets.
418
419Those files include information about the schema of the full dataset (for
420``_common_metadata``) and potentially all row group metadata of all files in the
421partitioned dataset as well (for ``_metadata``). The actual files are
422metadata-only Parquet files. Note this is not a Parquet standard, but a
423convention set in practice by those frameworks.
424
425Using those files can give a more efficient creation of a parquet Dataset,
426since it can use the stored schema and and file paths of all row groups,
427instead of inferring the schema and crawling the directories for all Parquet
428files (this is especially the case for filesystems where accessing files
429is expensive).
430
431The :func:`~pyarrow.parquet.write_to_dataset` function does not automatically
432write such metadata files, but you can use it to gather the metadata and
433combine and write them manually:
434
435.. code-block:: python
436
437   # Write a dataset and collect metadata information of all written files
438   metadata_collector = []
439   pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector)
440
441   # Write the ``_common_metadata`` parquet file without row groups statistics
442   pq.write_metadata(table.schema, root_path / '_common_metadata')
443
444   # Write the ``_metadata`` parquet file with row groups statistics of all files
445   pq.write_metadata(
446       table.schema, root_path / '_metadata',
447       metadata_collector=metadata_collector
448   )
449
450When not using the :func:`~pyarrow.parquet.write_to_dataset` function, but
451writing the individual files of the partitioned dataset using
452:func:`~pyarrow.parquet.write_table` or :class:`~pyarrow.parquet.ParquetWriter`,
453the ``metadata_collector`` keyword can also be used to collect the FileMetaData
454of the written files. In this case, you need to ensure to set the file path
455contained in the row group metadata yourself before combining the metadata, and
456the schemas of all different files and collected FileMetaData objects should be
457the same:
458
459.. code-block:: python
460
461   metadata_collector = []
462   pq.write_table(
463       table1, root_path / "year=2017/data1.parquet",
464       metadata_collector=metadata_collector
465   )
466
467   # set the file path relative to the root of the partitioned dataset
468   metadata_collector[-1].set_file_path("year=2017/data1.parquet")
469
470   # combine and write the metadata
471   metadata = metadata_collector[0]
472   for _meta in metadata_collector[1:]:
473       metadata.append_row_groups(_meta)
474   metadata.write_metadata_file(root_path / "_metadata")
475
476   # or use pq.write_metadata to combine and write in a single step
477   pq.write_metadata(
478       table1.schema, root_path / "_metadata",
479       metadata_collector=metadata_collector
480   )
481
482Reading from Partitioned Datasets
483------------------------------------------------
484
485The :class:`~.ParquetDataset` class accepts either a directory name or a list
486of file paths, and can discover and infer some common partition structures,
487such as those produced by Hive:
488
489.. code-block:: python
490
491   dataset = pq.ParquetDataset('dataset_name/')
492   table = dataset.read()
493
494You can also use the convenience function ``read_table`` exposed by
495``pyarrow.parquet`` that avoids the need for an additional Dataset object
496creation step.
497
498.. code-block:: python
499
500   table = pq.read_table('dataset_name')
501
502Note: the partition columns in the original table will have their types
503converted to Arrow dictionary types (pandas categorical) on load. Ordering of
504partition columns is not preserved through the save/load process. If reading
505from a remote filesystem into a pandas dataframe you may need to run
506``sort_index`` to maintain row ordering (as long as the ``preserve_index``
507option was enabled on write).
508
509.. note::
510
511   The ParquetDataset is being reimplemented based on the new generic Dataset
512   API (see the :ref:`dataset` docs for an overview). This is not yet the
513   default, but can already be enabled by passing the ``use_legacy_dataset=False``
514   keyword to :class:`ParquetDataset` or :func:`read_table`::
515
516      pq.ParquetDataset('dataset_name/', use_legacy_dataset=False)
517
518   Enabling this gives the following new features:
519
520   - Filtering on all columns (using row group statistics) instead of only on
521     the partition keys.
522   - More fine-grained partitioning: support for a directory partitioning scheme
523     in addition to the Hive-like partitioning (e.g. "/2019/11/15/" instead of
524     "/year=2019/month=11/day=15/"), and the ability to specify a schema for
525     the partition keys.
526   - General performance improvement and bug fixes.
527
528   It also has the following changes in behaviour:
529
530   - The partition keys need to be explicitly included in the ``columns``
531     keyword when you want to include them in the result while reading a
532     subset of the columns
533
534   This new implementation is already enabled in ``read_table``, and in the
535   future, this will be turned on by default for ``ParquetDataset``. The new
536   implementation does not yet cover all existing ParquetDataset features (e.g.
537   specifying the ``metadata``, or the ``pieces`` property API). Feedback is
538   very welcome.
539
540
541Using with Spark
542----------------
543
544Spark places some constraints on the types of Parquet files it will read. The
545option ``flavor='spark'`` will set these options automatically and also
546sanitize field characters unsupported by Spark SQL.
547
548Multithreaded Reads
549-------------------
550
551Each of the reading functions by default use multi-threading for reading
552columns in parallel. Depending on the speed of IO
553and how expensive it is to decode the columns in a particular file
554(particularly with GZIP compression), this can yield significantly higher data
555throughput.
556
557This can be disabled by specifying ``use_threads=False``.
558
559.. note::
560   The number of threads to use concurrently is automatically inferred by Arrow
561   and can be inspected using the :func:`~pyarrow.cpu_count()` function.
562
563Reading from cloud storage
564--------------------------
565
566In addition to local files, pyarrow supports other filesystems, such as cloud
567filesystems, through the ``filesystem`` keyword:
568
569.. code-block:: python
570
571    from pyarrow import fs
572
573    s3  = fs.S3FileSystem(region="us-east-2")
574    table = pq.read_table("bucket/object/key/prefix", filesystem=s3)
575
576Currently, :class:`HDFS <pyarrow.fs.HadoopFileSystem>` and
577:class:`Amazon S3-compatible storage <pyarrow.fs.S3FileSystem>` are
578supported. See the :ref:`filesystem` docs for more details. For those
579built-in filesystems, the filesystem can also be inferred from the file path,
580if specified as a URI:
581
582.. code-block:: python
583
584    table = pq.read_table("s3://bucket/object/key/prefix")
585
586Other filesystems can still be supported if there is an
587`fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`__-compatible
588implementation available. See :ref:`filesystem-fsspec` for more details.
589One example is Azure Blob storage, which can be interfaced through the
590`adlfs <https://github.com/dask/adlfs>`__ package.
591
592.. code-block:: python
593
594    from adlfs import AzureBlobFileSystem
595
596    abfs = AzureBlobFileSystem(account_name="XXXX", account_key="XXXX", container_name="XXXX")
597    table = pq.read_table("file.parquet", filesystem=abfs)
598