1.. Licensed to the Apache Software Foundation (ASF) under one 2.. or more contributor license agreements. See the NOTICE file 3.. distributed with this work for additional information 4.. regarding copyright ownership. The ASF licenses this file 5.. to you under the Apache License, Version 2.0 (the 6.. "License"); you may not use this file except in compliance 7.. with the License. You may obtain a copy of the License at 8 9.. http://www.apache.org/licenses/LICENSE-2.0 10 11.. Unless required by applicable law or agreed to in writing, 12.. software distributed under the License is distributed on an 13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14.. KIND, either express or implied. See the License for the 15.. specific language governing permissions and limitations 16.. under the License. 17 18.. currentmodule:: pyarrow 19.. _parquet: 20 21Reading and Writing the Apache Parquet Format 22============================================= 23 24The `Apache Parquet <http://parquet.apache.org/>`_ project provides a 25standardized open-source columnar storage format for use in data analysis 26systems. It was created originally for use in `Apache Hadoop 27<http://hadoop.apache.org/>`_ with systems like `Apache Drill 28<http://drill.apache.org>`_, `Apache Hive <http://hive.apache.org>`_, `Apache 29Impala (incubating) <http://impala.apache.org>`_, and `Apache Spark 30<http://spark.apache.org>`_ adopting it as a shared standard for high 31performance data IO. 32 33Apache Arrow is an ideal in-memory transport layer for data that is being read 34or written with Parquet files. We have been concurrently developing the `C++ 35implementation of Apache Parquet <http://github.com/apache/parquet-cpp>`_, 36which includes a native, multithreaded C++ adapter to and from in-memory Arrow 37data. PyArrow includes Python bindings to this code, which thus enables reading 38and writing Parquet files with pandas as well. 39 40Obtaining pyarrow with Parquet Support 41-------------------------------------- 42 43If you installed ``pyarrow`` with pip or conda, it should be built with Parquet 44support bundled: 45 46.. ipython:: python 47 48 import pyarrow.parquet as pq 49 50If you are building ``pyarrow`` from source, you must use 51``-DARROW_PARQUET=ON`` when compiling the C++ libraries and enable the Parquet 52extensions when building ``pyarrow``. See the :ref:`Python Development 53<python-development>` page for more details. 54 55Reading and Writing Single Files 56-------------------------------- 57 58The functions :func:`~.parquet.read_table` and :func:`~.parquet.write_table` 59read and write the :ref:`pyarrow.Table <data.table>` object, respectively. 60 61Let's look at a simple table: 62 63.. ipython:: python 64 65 import numpy as np 66 import pandas as pd 67 import pyarrow as pa 68 69 df = pd.DataFrame({'one': [-1, np.nan, 2.5], 70 'two': ['foo', 'bar', 'baz'], 71 'three': [True, False, True]}, 72 index=list('abc')) 73 table = pa.Table.from_pandas(df) 74 75We write this to Parquet format with ``write_table``: 76 77.. ipython:: python 78 79 import pyarrow.parquet as pq 80 pq.write_table(table, 'example.parquet') 81 82This creates a single Parquet file. In practice, a Parquet dataset may consist 83of many files in many directories. We can read a single file back with 84``read_table``: 85 86.. ipython:: python 87 88 table2 = pq.read_table('example.parquet') 89 table2.to_pandas() 90 91You can pass a subset of columns to read, which can be much faster than reading 92the whole file (due to the columnar layout): 93 94.. ipython:: python 95 96 pq.read_table('example.parquet', columns=['one', 'three']) 97 98When reading a subset of columns from a file that used a Pandas dataframe as the 99source, we use ``read_pandas`` to maintain any additional index column data: 100 101.. ipython:: python 102 103 pq.read_pandas('example.parquet', columns=['two']).to_pandas() 104 105We need not use a string to specify the origin of the file. It can be any of: 106 107* A file path as a string 108* A :ref:`NativeFile <io.native_file>` from PyArrow 109* A Python file object 110 111In general, a Python file object will have the worst read performance, while a 112string file path or an instance of :class:`~.NativeFile` (especially memory 113maps) will perform the best. 114 115.. _parquet_mmap: 116 117Reading Parquet and Memory Mapping 118~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 119 120Because Parquet data needs to be decoded from the Parquet format 121and compression, it can't be directly mapped from disk. 122Thus the ``memory_map`` option might perform better on some systems 123but won't help much with resident memory consumption. 124 125.. code-block:: python 126 127 >>> pq_array = pa.parquet.read_table("area1.parquet", memory_map=True) 128 >>> print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20)) 129 RSS: 4299MB 130 131 >>> pq_array = pa.parquet.read_table("area1.parquet", memory_map=False) 132 >>> print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20)) 133 RSS: 4299MB 134 135If you need to deal with Parquet data bigger than memory, 136the :ref:`dataset` and partitioning is probably what you are looking for. 137 138Parquet file writing options 139~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 140 141:func:`~pyarrow.parquet.write_table()` has a number of options to 142control various settings when writing a Parquet file. 143 144* ``version``, the Parquet format version to use. ``'1.0'`` ensures 145 compatibility with older readers, while ``'2.4'`` and greater values 146 enable more Parquet types and encodings. 147* ``data_page_size``, to control the approximate size of encoded data 148 pages within a column chunk. This currently defaults to 1MB. 149* ``flavor``, to set compatibility options particular to a Parquet 150 consumer like ``'spark'`` for Apache Spark. 151 152See the :func:`~pyarrow.parquet.write_table()` docstring for more details. 153 154There are some additional data type handling-specific options 155described below. 156 157Omitting the DataFrame index 158~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 159 160When using ``pa.Table.from_pandas`` to convert to an Arrow table, by default 161one or more special columns are added to keep track of the index (row 162labels). Storing the index takes extra space, so if your index is not valuable, 163you may choose to omit it by passing ``preserve_index=False`` 164 165.. ipython:: python 166 167 df = pd.DataFrame({'one': [-1, np.nan, 2.5], 168 'two': ['foo', 'bar', 'baz'], 169 'three': [True, False, True]}, 170 index=list('abc')) 171 df 172 table = pa.Table.from_pandas(df, preserve_index=False) 173 174Then we have: 175 176.. ipython:: python 177 178 pq.write_table(table, 'example_noindex.parquet') 179 t = pq.read_table('example_noindex.parquet') 180 t.to_pandas() 181 182Here you see the index did not survive the round trip. 183 184Finer-grained Reading and Writing 185--------------------------------- 186 187``read_table`` uses the :class:`~.ParquetFile` class, which has other features: 188 189.. ipython:: python 190 191 parquet_file = pq.ParquetFile('example.parquet') 192 parquet_file.metadata 193 parquet_file.schema 194 195As you can learn more in the `Apache Parquet format 196<https://github.com/apache/parquet-format>`_, a Parquet file consists of 197multiple row groups. ``read_table`` will read all of the row groups and 198concatenate them into a single table. You can read individual row groups with 199``read_row_group``: 200 201.. ipython:: python 202 203 parquet_file.num_row_groups 204 parquet_file.read_row_group(0) 205 206We can similarly write a Parquet file with multiple row groups by using 207``ParquetWriter``: 208 209.. ipython:: python 210 211 with pq.ParquetWriter('example2.parquet', table.schema) as writer: 212 for i in range(3): 213 writer.write_table(table) 214 215 pf2 = pq.ParquetFile('example2.parquet') 216 pf2.num_row_groups 217 218Inspecting the Parquet File Metadata 219------------------------------------ 220 221The ``FileMetaData`` of a Parquet file can be accessed through 222:class:`~.ParquetFile` as shown above: 223 224.. ipython:: python 225 226 parquet_file = pq.ParquetFile('example.parquet') 227 metadata = parquet_file.metadata 228 229or can also be read directly using :func:`~parquet.read_metadata`: 230 231.. ipython:: python 232 233 metadata = pq.read_metadata('example.parquet') 234 metadata 235 236The returned ``FileMetaData`` object allows to inspect the 237`Parquet file metadata <https://github.com/apache/parquet-format#metadata>`__, 238such as the row groups and column chunk metadata and statistics: 239 240.. ipython:: python 241 242 metadata.row_group(0) 243 metadata.row_group(0).column(0) 244 245.. ipython:: python 246 :suppress: 247 248 !rm example.parquet 249 !rm example_noindex.parquet 250 !rm example2.parquet 251 !rm example3.parquet 252 253Data Type Handling 254------------------ 255 256Reading types as DictionaryArray 257~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 258 259The ``read_dictionary`` option in ``read_table`` and ``ParquetDataset`` will 260cause columns to be read as ``DictionaryArray``, which will become 261``pandas.Categorical`` when converted to pandas. This option is only valid for 262string and binary column types, and it can yield significantly lower memory use 263and improved performance for columns with many repeated string values. 264 265.. code-block:: python 266 267 pq.read_table(table, where, read_dictionary=['binary_c0', 'stringb_c2']) 268 269Storing timestamps 270~~~~~~~~~~~~~~~~~~ 271 272Some Parquet readers may only support timestamps stored in millisecond 273(``'ms'``) or microsecond (``'us'``) resolution. Since pandas uses nanoseconds 274to represent timestamps, this can occasionally be a nuisance. By default 275(when writing version 1.0 Parquet files), the nanoseconds will be cast to 276microseconds ('us'). 277 278In addition, We provide the ``coerce_timestamps`` option to allow you to select 279the desired resolution: 280 281.. code-block:: python 282 283 pq.write_table(table, where, coerce_timestamps='ms') 284 285If a cast to a lower resolution value may result in a loss of data, by default 286an exception will be raised. This can be suppressed by passing 287``allow_truncated_timestamps=True``: 288 289.. code-block:: python 290 291 pq.write_table(table, where, coerce_timestamps='ms', 292 allow_truncated_timestamps=True) 293 294Timestamps with nanoseconds can be stored without casting when using the 295more recent Parquet format version 2.6: 296 297.. code-block:: python 298 299 pq.write_table(table, where, version='2.6') 300 301However, many Parquet readers do not yet support this newer format version, and 302therefore the default is to write version 1.0 files. When compatibility across 303different processing frameworks is required, it is recommended to use the 304default version 1.0. 305 306Older Parquet implementations use ``INT96`` based storage of 307timestamps, but this is now deprecated. This includes some older 308versions of Apache Impala and Apache Spark. To write timestamps in 309this format, set the ``use_deprecated_int96_timestamps`` option to 310``True`` in ``write_table``. 311 312.. code-block:: python 313 314 pq.write_table(table, where, use_deprecated_int96_timestamps=True) 315 316Compression, Encoding, and File Compatibility 317--------------------------------------------- 318 319The most commonly used Parquet implementations use dictionary encoding when 320writing files; if the dictionaries grow too large, then they "fall back" to 321plain encoding. Whether dictionary encoding is used can be toggled using the 322``use_dictionary`` option: 323 324.. code-block:: python 325 326 pq.write_table(table, where, use_dictionary=False) 327 328The data pages within a column in a row group can be compressed after the 329encoding passes (dictionary, RLE encoding). In PyArrow we use Snappy 330compression by default, but Brotli, Gzip, and uncompressed are also supported: 331 332.. code-block:: python 333 334 pq.write_table(table, where, compression='snappy') 335 pq.write_table(table, where, compression='gzip') 336 pq.write_table(table, where, compression='brotli') 337 pq.write_table(table, where, compression='none') 338 339Snappy generally results in better performance, while Gzip may yield smaller 340files. 341 342These settings can also be set on a per-column basis: 343 344.. code-block:: python 345 346 pq.write_table(table, where, compression={'foo': 'snappy', 'bar': 'gzip'}, 347 use_dictionary=['foo', 'bar']) 348 349Partitioned Datasets (Multiple Files) 350------------------------------------------------ 351 352Multiple Parquet files constitute a Parquet *dataset*. These may present in a 353number of ways: 354 355* A list of Parquet absolute file paths 356* A directory name containing nested directories defining a partitioned dataset 357 358A dataset partitioned by year and month may look like on disk: 359 360.. code-block:: text 361 362 dataset_name/ 363 year=2007/ 364 month=01/ 365 0.parq 366 1.parq 367 ... 368 month=02/ 369 0.parq 370 1.parq 371 ... 372 month=03/ 373 ... 374 year=2008/ 375 month=01/ 376 ... 377 ... 378 379Writing to Partitioned Datasets 380------------------------------- 381 382You can write a partitioned dataset for any ``pyarrow`` file system that is a 383file-store (e.g. local, HDFS, S3). The default behaviour when no filesystem is 384added is to use the local filesystem. 385 386.. code-block:: python 387 388 # Local dataset write 389 pq.write_to_dataset(table, root_path='dataset_name', 390 partition_cols=['one', 'two']) 391 392The root path in this case specifies the parent directory to which data will be 393saved. The partition columns are the column names by which to partition the 394dataset. Columns are partitioned in the order they are given. The partition 395splits are determined by the unique values in the partition columns. 396 397To use another filesystem you only need to add the filesystem parameter, the 398individual table writes are wrapped using ``with`` statements so the 399``pq.write_to_dataset`` function does not need to be. 400 401.. code-block:: python 402 403 # Remote file-system example 404 from pyarrow.fs import HadoopFileSystem 405 fs = HadoopFileSystem(host, port, user=user, kerb_ticket=ticket_cache_path) 406 pq.write_to_dataset(table, root_path='dataset_name', 407 partition_cols=['one', 'two'], filesystem=fs) 408 409Compatibility Note: if using ``pq.write_to_dataset`` to create a table that 410will then be used by HIVE then partition column values must be compatible with 411the allowed character set of the HIVE version you are running. 412 413Writing ``_metadata`` and ``_common_medata`` files 414~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 415 416Some processing frameworks such as Spark or Dask (optionally) use ``_metadata`` 417and ``_common_metadata`` files with partitioned datasets. 418 419Those files include information about the schema of the full dataset (for 420``_common_metadata``) and potentially all row group metadata of all files in the 421partitioned dataset as well (for ``_metadata``). The actual files are 422metadata-only Parquet files. Note this is not a Parquet standard, but a 423convention set in practice by those frameworks. 424 425Using those files can give a more efficient creation of a parquet Dataset, 426since it can use the stored schema and and file paths of all row groups, 427instead of inferring the schema and crawling the directories for all Parquet 428files (this is especially the case for filesystems where accessing files 429is expensive). 430 431The :func:`~pyarrow.parquet.write_to_dataset` function does not automatically 432write such metadata files, but you can use it to gather the metadata and 433combine and write them manually: 434 435.. code-block:: python 436 437 # Write a dataset and collect metadata information of all written files 438 metadata_collector = [] 439 pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector) 440 441 # Write the ``_common_metadata`` parquet file without row groups statistics 442 pq.write_metadata(table.schema, root_path / '_common_metadata') 443 444 # Write the ``_metadata`` parquet file with row groups statistics of all files 445 pq.write_metadata( 446 table.schema, root_path / '_metadata', 447 metadata_collector=metadata_collector 448 ) 449 450When not using the :func:`~pyarrow.parquet.write_to_dataset` function, but 451writing the individual files of the partitioned dataset using 452:func:`~pyarrow.parquet.write_table` or :class:`~pyarrow.parquet.ParquetWriter`, 453the ``metadata_collector`` keyword can also be used to collect the FileMetaData 454of the written files. In this case, you need to ensure to set the file path 455contained in the row group metadata yourself before combining the metadata, and 456the schemas of all different files and collected FileMetaData objects should be 457the same: 458 459.. code-block:: python 460 461 metadata_collector = [] 462 pq.write_table( 463 table1, root_path / "year=2017/data1.parquet", 464 metadata_collector=metadata_collector 465 ) 466 467 # set the file path relative to the root of the partitioned dataset 468 metadata_collector[-1].set_file_path("year=2017/data1.parquet") 469 470 # combine and write the metadata 471 metadata = metadata_collector[0] 472 for _meta in metadata_collector[1:]: 473 metadata.append_row_groups(_meta) 474 metadata.write_metadata_file(root_path / "_metadata") 475 476 # or use pq.write_metadata to combine and write in a single step 477 pq.write_metadata( 478 table1.schema, root_path / "_metadata", 479 metadata_collector=metadata_collector 480 ) 481 482Reading from Partitioned Datasets 483------------------------------------------------ 484 485The :class:`~.ParquetDataset` class accepts either a directory name or a list 486of file paths, and can discover and infer some common partition structures, 487such as those produced by Hive: 488 489.. code-block:: python 490 491 dataset = pq.ParquetDataset('dataset_name/') 492 table = dataset.read() 493 494You can also use the convenience function ``read_table`` exposed by 495``pyarrow.parquet`` that avoids the need for an additional Dataset object 496creation step. 497 498.. code-block:: python 499 500 table = pq.read_table('dataset_name') 501 502Note: the partition columns in the original table will have their types 503converted to Arrow dictionary types (pandas categorical) on load. Ordering of 504partition columns is not preserved through the save/load process. If reading 505from a remote filesystem into a pandas dataframe you may need to run 506``sort_index`` to maintain row ordering (as long as the ``preserve_index`` 507option was enabled on write). 508 509.. note:: 510 511 The ParquetDataset is being reimplemented based on the new generic Dataset 512 API (see the :ref:`dataset` docs for an overview). This is not yet the 513 default, but can already be enabled by passing the ``use_legacy_dataset=False`` 514 keyword to :class:`ParquetDataset` or :func:`read_table`:: 515 516 pq.ParquetDataset('dataset_name/', use_legacy_dataset=False) 517 518 Enabling this gives the following new features: 519 520 - Filtering on all columns (using row group statistics) instead of only on 521 the partition keys. 522 - More fine-grained partitioning: support for a directory partitioning scheme 523 in addition to the Hive-like partitioning (e.g. "/2019/11/15/" instead of 524 "/year=2019/month=11/day=15/"), and the ability to specify a schema for 525 the partition keys. 526 - General performance improvement and bug fixes. 527 528 It also has the following changes in behaviour: 529 530 - The partition keys need to be explicitly included in the ``columns`` 531 keyword when you want to include them in the result while reading a 532 subset of the columns 533 534 This new implementation is already enabled in ``read_table``, and in the 535 future, this will be turned on by default for ``ParquetDataset``. The new 536 implementation does not yet cover all existing ParquetDataset features (e.g. 537 specifying the ``metadata``, or the ``pieces`` property API). Feedback is 538 very welcome. 539 540 541Using with Spark 542---------------- 543 544Spark places some constraints on the types of Parquet files it will read. The 545option ``flavor='spark'`` will set these options automatically and also 546sanitize field characters unsupported by Spark SQL. 547 548Multithreaded Reads 549------------------- 550 551Each of the reading functions by default use multi-threading for reading 552columns in parallel. Depending on the speed of IO 553and how expensive it is to decode the columns in a particular file 554(particularly with GZIP compression), this can yield significantly higher data 555throughput. 556 557This can be disabled by specifying ``use_threads=False``. 558 559.. note:: 560 The number of threads to use concurrently is automatically inferred by Arrow 561 and can be inspected using the :func:`~pyarrow.cpu_count()` function. 562 563Reading from cloud storage 564-------------------------- 565 566In addition to local files, pyarrow supports other filesystems, such as cloud 567filesystems, through the ``filesystem`` keyword: 568 569.. code-block:: python 570 571 from pyarrow import fs 572 573 s3 = fs.S3FileSystem(region="us-east-2") 574 table = pq.read_table("bucket/object/key/prefix", filesystem=s3) 575 576Currently, :class:`HDFS <pyarrow.fs.HadoopFileSystem>` and 577:class:`Amazon S3-compatible storage <pyarrow.fs.S3FileSystem>` are 578supported. See the :ref:`filesystem` docs for more details. For those 579built-in filesystems, the filesystem can also be inferred from the file path, 580if specified as a URI: 581 582.. code-block:: python 583 584 table = pq.read_table("s3://bucket/object/key/prefix") 585 586Other filesystems can still be supported if there is an 587`fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`__-compatible 588implementation available. See :ref:`filesystem-fsspec` for more details. 589One example is Azure Blob storage, which can be interfaced through the 590`adlfs <https://github.com/dask/adlfs>`__ package. 591 592.. code-block:: python 593 594 from adlfs import AzureBlobFileSystem 595 596 abfs = AzureBlobFileSystem(account_name="XXXX", account_key="XXXX", container_name="XXXX") 597 table = pq.read_table("file.parquet", filesystem=abfs) 598