1.. currentmodule:: h5py 2.. _dataset: 3 4 5Datasets 6======== 7 8Datasets are very similar to NumPy arrays. They are homogeneous collections of 9data elements, with an immutable datatype and (hyper)rectangular shape. 10Unlike NumPy arrays, they support a variety of transparent storage features 11such as compression, error-detection, and chunked I/O. 12 13They are represented in h5py by a thin proxy class which supports familiar 14NumPy operations like slicing, along with a variety of descriptive attributes: 15 16 - **shape** attribute 17 - **size** attribute 18 - **ndim** attribute 19 - **dtype** attribute 20 - **nbytes** attribute 21 22h5py supports most NumPy dtypes, and uses the same character codes (e.g. 23``'f'``, ``'i8'``) and dtype machinery as 24`Numpy <https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html>`_. 25See :ref:`faq` for the list of dtypes h5py supports. 26 27 28.. _dataset_create: 29 30Creating datasets 31----------------- 32 33New datasets are created using either :meth:`Group.create_dataset` or 34:meth:`Group.require_dataset`. Existing datasets should be retrieved using 35the group indexing syntax (``dset = group["name"]``). 36 37To initialise a dataset, all you have to do is specify a name, shape, and 38optionally the data type (defaults to ``'f'``):: 39 40 >>> dset = f.create_dataset("default", (100,)) 41 >>> dset = f.create_dataset("ints", (100,), dtype='i8') 42 43.. note:: This is not the same as creating an :ref:`Empty dataset <dataset_empty>`. 44 45You may also initialize the dataset to an existing NumPy array by providing the `data` parameter:: 46 47 >>> arr = np.arange(100) 48 >>> dset = f.create_dataset("init", data=arr) 49 50Keywords ``shape`` and ``dtype`` may be specified along with ``data``; if so, 51they will override ``data.shape`` and ``data.dtype``. It's required that 52(1) the total number of points in ``shape`` match the total number of points 53in ``data.shape``, and that (2) it's possible to cast ``data.dtype`` to 54the requested ``dtype``. 55 56.. _dataset_slicing: 57 58Reading & writing data 59---------------------- 60 61HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. 62Slice specifications are translated directly to HDF5 "hyperslab" 63selections, and are a fast and efficient way to access data in the file. The 64following slicing arguments are recognized: 65 66 * Indices: anything that can be converted to a Python long 67 * Slices (i.e. ``[:]`` or ``[0:10]``) 68 * Field names, in the case of compound data 69 * At most one ``Ellipsis`` (``...``) object 70 * An empty tuple (``()``) to retrieve all data or `scalar` data 71 72Here are a few examples (output omitted). 73 74 >>> dset = f.create_dataset("MyDataset", (10,10,10), 'f') 75 >>> dset[0,0,0] 76 >>> dset[0,2:10,1:9:3] 77 >>> dset[:,::2,5] 78 >>> dset[0] 79 >>> dset[1,5] 80 >>> dset[0,...] 81 >>> dset[...,6] 82 >>> dset[()] 83 84There's more documentation on what parts of numpy's :ref:`fancy indexing <dataset_fancy>` are available in h5py. 85 86For compound data, it is advised to separate field names from the 87numeric slices:: 88 89 >>> dset.fields("FieldA")[:10] # Read a single field 90 >>> dset[:10]["FieldA"] # Read all fields, select in NumPy 91 92It is also possible to mix indexing and field names (``dset[:10, "FieldA"]``), 93but this might be removed in a future version of h5py. 94 95To retrieve the contents of a `scalar` dataset, you can use the same 96syntax as in NumPy: ``result = dset[()]``. In other words, index into 97the dataset using an empty tuple. 98 99For simple slicing, broadcasting is supported: 100 101 >>> dset[0,:,:] = np.arange(10) # Broadcasts to (10,10) 102 103Broadcasting is implemented using repeated hyperslab selections, and is 104safe to use with very large target selections. It is supported for the above 105"simple" (integer, slice and ellipsis) slicing only. 106 107.. warning:: 108 Currently h5py does not support nested compound types, see :issue:`1197` for 109 more information. 110 111Multiple indexing 112~~~~~~~~~~~~~~~~~ 113 114Indexing a dataset once loads a numpy array into memory. 115If you try to index it twice to write data, you may be surprised that nothing 116seems to have happened: 117 118 >>> f = h5py.File('my_hdf5_file.h5', 'w') 119 >>> dset = f.create_dataset("test", (2, 2)) 120 >>> dset[0][1] = 3.0 # No effect! 121 >>> print(dset[0][1]) 122 0.0 123 124The assignment above only modifies the loaded array. It's equivalent to this: 125 126 >>> new_array = dset[0] 127 >>> new_array[1] = 3.0 128 >>> print(new_array[1]) 129 3.0 130 >>> print(dset[0][1]) 131 0.0 132 133To write to the dataset, combine the indexes in a single step: 134 135 >>> dset[0, 1] = 3.0 136 >>> print(dset[0, 1]) 137 3.0 138 139.. _dataset_iter: 140 141Length and iteration 142~~~~~~~~~~~~~~~~~~~~ 143 144As with NumPy arrays, the ``len()`` of a dataset is the length of the first 145axis, and iterating over a dataset iterates over the first axis. However, 146modifications to the yielded data are not recorded in the file. Resizing a 147dataset while iterating has undefined results. 148 149On 32-bit platforms, ``len(dataset)`` will fail if the first axis is bigger 150than 2**32. It's recommended to use :meth:`Dataset.len` for large datasets. 151 152.. _dataset_chunks: 153 154Chunked storage 155--------------- 156 157An HDF5 dataset created with the default settings will be `contiguous`; in 158other words, laid out on disk in traditional C order. Datasets may also be 159created using HDF5's `chunked` storage layout. This means the dataset is 160divided up into regularly-sized pieces which are stored haphazardly on disk, 161and indexed using a B-tree. 162 163Chunked storage makes it possible to resize datasets, and because the data 164is stored in fixed-size chunks, to use compression filters. 165 166To enable chunked storage, set the keyword ``chunks`` to a tuple indicating 167the chunk shape:: 168 169 >>> dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100)) 170 171Data will be read and written in blocks with shape (100,100); for example, 172the data in ``dset[0:100,0:100]`` will be stored together in the file, as will 173the data points in range ``dset[400:500, 100:200]``. 174 175Chunking has performance implications. It's recommended to keep the total 176size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. 177Also keep in mind that when any element in a chunk is accessed, the entire 178chunk is read from disk. 179 180Since picking a chunk shape can be confusing, you can have h5py guess a chunk 181shape for you:: 182 183 >>> dset = f.create_dataset("autochunk", (1000, 1000), chunks=True) 184 185Auto-chunking is also enabled when using compression or ``maxshape``, etc., 186if a chunk shape is not manually specified. 187 188The iter_chunks method returns an iterator that can be used to perform chunk by chunk 189reads or writes:: 190 191 >>> for s in dset.iter_chunks(): 192 >>> arr = dset[s] # get numpy array for chunk 193 194 195.. _dataset_resize: 196 197Resizable datasets 198------------------ 199 200In HDF5, datasets can be resized once created up to a maximum size, 201by calling :meth:`Dataset.resize`. You specify this maximum size when creating 202the dataset, via the keyword ``maxshape``:: 203 204 >>> dset = f.create_dataset("resizable", (10,10), maxshape=(500, 20)) 205 206Any (or all) axes may also be marked as "unlimited", in which case they may 207be increased up to the HDF5 per-axis limit of 2**64 elements. Indicate these 208axes using ``None``:: 209 210 >>> dset = f.create_dataset("unlimited", (10, 10), maxshape=(None, 10)) 211 212.. note:: Resizing an array with existing data works differently than in NumPy; if 213 any axis shrinks, the data in the missing region is discarded. Data does 214 not "rearrange" itself as it does when resizing a NumPy array. 215 216 217.. _dataset_compression: 218 219Filter pipeline 220--------------- 221 222Chunked data may be transformed by the HDF5 `filter pipeline`. The most 223common use is applying transparent compression. Data is compressed on the 224way to disk, and automatically decompressed when read. Once the dataset 225is created with a particular compression filter applied, data may be read 226and written as normal with no special steps required. 227 228Enable compression with the ``compression`` keyword to 229:meth:`Group.create_dataset`:: 230 231 >>> dset = f.create_dataset("zipped", (100, 100), compression="gzip") 232 233Options for each filter may be specified with ``compression_opts``:: 234 235 >>> dset = f.create_dataset("zipped_max", (100, 100), compression="gzip", compression_opts=9) 236 237Lossless compression filters 238~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 239 240GZIP filter (``"gzip"``) 241 Available with every installation of HDF5, so it's best where portability is 242 required. Good compression, moderate speed. ``compression_opts`` sets the 243 compression level and may be an integer from 0 to 9, default is 4. 244 245 246LZF filter (``"lzf"``) 247 Available with every installation of h5py (C source code also available). 248 Low to moderate compression, very fast. No options. 249 250 251SZIP filter (``"szip"``) 252 Patent-encumbered filter used in the NASA community. Not available with all 253 installations of HDF5 due to legal reasons. Consult the HDF5 docs for filter 254 options. 255 256Custom compression filters 257~~~~~~~~~~~~~~~~~~~~~~~~~~ 258 259In addition to the compression filters listed above, compression filters can be 260dynamically loaded by the underlying HDF5 library. This is done by passing a 261filter number to :meth:`Group.create_dataset` as the ``compression`` parameter. 262The ``compression_opts`` parameter will then be passed to this filter. 263 264.. seealso:: 265 266 `hdf5plugin <https://pypi.org/project/hdf5plugin/>`_ 267 A Python package of several popular filters, including Blosc, LZ4 and ZFP, 268 for convenient use with h5py 269 270 `HDF5 Filter Plugins <https://portal.hdfgroup.org/display/support/HDF5+Filter+Plugins>`_ 271 A collection of filters as a single download from The HDF Group 272 273 `Registered filter plugins <https://portal.hdfgroup.org/display/support/Filters>`_ 274 The index of publicly announced filter plugins 275 276.. note:: The underlying implementation of the compression filter will have the 277 ``H5Z_FLAG_OPTIONAL`` flag set. This indicates that if the compression 278 filter doesn't compress a block while writing, no error will be thrown. The 279 filter will then be skipped when subsequently reading the block. 280 281 282.. _dataset_scaleoffset: 283 284Scale-Offset filter 285~~~~~~~~~~~~~~~~~~~ 286 287Filters enabled with the ``compression`` keywords are *lossless*; what comes 288out of the dataset is exactly what you put in. HDF5 also includes a lossy 289filter which trades precision for storage space. 290 291Works with integer and floating-point data only. Enable the scale-offset 292filter by setting :meth:`Group.create_dataset` keyword ``scaleoffset`` to an 293integer. 294 295For integer data, this specifies the number of bits to retain. Set to 0 to have 296HDF5 automatically compute the number of bits required for lossless compression 297of the chunk. For floating-point data, indicates the number of digits after 298the decimal point to retain. 299 300.. warning:: 301 Currently the scale-offset filter does not preserve special float values 302 (i.e. NaN, inf), see 303 https://forum.hdfgroup.org/t/scale-offset-filter-and-special-float-values-nan-infinity/3379 304 for more information and follow-up. 305 306 307.. _dataset_shuffle: 308 309Shuffle filter 310~~~~~~~~~~~~~~ 311 312Block-oriented compressors like GZIP or LZF work better when presented with 313runs of similar values. Enabling the shuffle filter rearranges the bytes in 314the chunk and may improve compression ratio. No significant speed penalty, 315lossless. 316 317Enable by setting :meth:`Group.create_dataset` keyword ``shuffle`` to True. 318 319 320.. _dataset_fletcher32: 321 322Fletcher32 filter 323~~~~~~~~~~~~~~~~~ 324 325Adds a checksum to each chunk to detect data corruption. Attempts to read 326corrupted chunks will fail with an error. No significant speed penalty. 327Obviously shouldn't be used with lossy compression filters. 328 329Enable by setting :meth:`Group.create_dataset` keyword ``fletcher32`` to True. 330 331.. _dataset_multi_block: 332 333Multi-Block Selection 334--------------------- 335 336The full H5Sselect_hyperslab API is exposed via the MultiBlockSlice object. 337This takes four elements to define the selection (start, count, stride and 338block) in contrast to the built-in slice object, which takes three elements. 339A MultiBlockSlice can be used in place of a slice to select a number of (count) 340blocks of multiple elements separated by a stride, rather than a set of single 341elements separated by a step. 342 343For an explanation of how this slicing works, see the `HDF5 documentation <https://support.hdfgroup.org/HDF5/Tutor/selectsimple.html>`_. 344 345For example:: 346 347 >>> dset[...] 348 array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) 349 >>> dset[MultiBlockSlice(start=1, count=3, stride=4, block=2)] 350 array([ 1, 2, 5, 6, 9, 10]) 351 352They can be used in multi-dimensional slices alongside any slicing object, 353including other MultiBlockSlices. For a more complete example of this, 354see the multiblockslice_interleave.py example script. 355 356.. _dataset_fancy: 357 358Fancy indexing 359-------------- 360 361A subset of the NumPy fancy-indexing syntax is supported. Use this with 362caution, as the underlying HDF5 mechanisms may have different performance 363than you expect. 364 365For any axis, you can provide an explicit list of points you want; for a 366dataset with shape (10, 10):: 367 368 >>> dset.shape 369 (10, 10) 370 >>> result = dset[0, [1,3,8]] 371 >>> result.shape 372 (3,) 373 >>> result = dset[1:6, [5,8,9]] 374 >>> result.shape 375 (5, 3) 376 377The following restrictions exist: 378 379* Selection coordinates must be given in increasing order 380* Duplicate selections are ignored 381* Very long lists (> 1000 elements) may produce poor performance 382 383NumPy boolean "mask" arrays can also be used to specify a selection. The 384result of this operation is a 1-D array with elements arranged in the 385standard NumPy (C-style) order. Behind the scenes, this generates a laundry 386list of points to select, so be careful when using it with large masks:: 387 388 >>> arr = numpy.arange(100).reshape((10,10)) 389 >>> dset = f.create_dataset("MyDataset", data=arr) 390 >>> result = dset[arr > 50] 391 >>> result.shape 392 (49,) 393 394.. versionchanged:: 2.10 395 Selecting using an empty list is now allowed. 396 This returns an array with length 0 in the relevant dimension. 397 398.. _dataset_empty: 399 400Creating and Reading Empty (or Null) datasets and attributes 401------------------------------------------------------------ 402 403HDF5 has the concept of Empty or Null datasets and attributes. These are not 404the same as an array with a shape of (), or a scalar dataspace in HDF5 terms. 405Instead, it is a dataset with an associated type, no data, and no shape. In 406h5py, we represent this as either a dataset with shape ``None``, or an 407instance of ``h5py.Empty``. Empty datasets and attributes cannot be sliced. 408 409To create an empty attribute, use ``h5py.Empty`` as per :ref:`attributes`:: 410 411 >>> obj.attrs["EmptyAttr"] = h5py.Empty("f") 412 413Similarly, reading an empty attribute returns ``h5py.Empty``:: 414 415 >>> obj.attrs["EmptyAttr"] 416 h5py.Empty(dtype="f") 417 418Empty datasets can be created either by defining a ``dtype`` but no 419``shape`` in ``create_dataset``:: 420 421 >>> grp.create_dataset("EmptyDataset", dtype="f") 422 423or by ``data`` to an instance of ``h5py.Empty``:: 424 425 >>> grp.create_dataset("EmptyDataset", data=h5py.Empty("f")) 426 427An empty dataset has shape defined as ``None``, which is the best way of 428determining whether a dataset is empty or not. An empty dataset can be "read" in 429a similar way to scalar datasets, i.e. if ``empty_dataset`` is an empty 430dataset:: 431 432 >>> empty_dataset[()] 433 h5py.Empty(dtype="f") 434 435The dtype of the dataset can be accessed via ``<dset>.dtype`` as per normal. 436As empty datasets cannot be sliced, some methods of datasets such as 437``read_direct`` will raise a ``TypeError`` exception if used on a empty dataset. 438 439Reference 440--------- 441 442.. class:: Dataset(identifier) 443 444 Dataset objects are typically created via :meth:`Group.create_dataset`, 445 or by retrieving existing datasets from a file. Call this constructor to 446 create a new Dataset bound to an existing 447 :class:`DatasetID <low:h5py.h5d.DatasetID>` identifier. 448 449 .. method:: __getitem__(args) 450 451 NumPy-style slicing to retrieve data. See :ref:`dataset_slicing`. 452 453 .. method:: __setitem__(args) 454 455 NumPy-style slicing to write data. See :ref:`dataset_slicing`. 456 457 .. method:: __bool__() 458 459 Check that the dataset is accessible. 460 A dataset could be inaccessible for several reasons. For instance, the 461 dataset, or the file it belongs to, may have been closed elsewhere. 462 463 >>> f = h5py.open(filename) 464 >>> dset = f["MyDS"] 465 >>> f.close() 466 >>> if dset: 467 ... print("datset accessible") 468 ... else: 469 ... print("dataset inaccessible") 470 dataset inaccessible 471 472 .. method:: read_direct(array, source_sel=None, dest_sel=None) 473 474 Read from an HDF5 dataset directly into a NumPy array, which can 475 avoid making an intermediate copy as happens with slicing. The 476 destination array must be C-contiguous and writable, and must have 477 a datatype to which the source data may be cast. Data type conversion 478 will be carried out on the fly by HDF5. 479 480 `source_sel` and `dest_sel` indicate the range of points in the 481 dataset and destination array respectively. Use the output of 482 ``numpy.s_[args]``:: 483 484 >>> dset = f.create_dataset("dset", (100,), dtype='int64') 485 >>> arr = np.zeros((100,), dtype='int32') 486 >>> dset.read_direct(arr, np.s_[0:10], np.s_[50:60]) 487 488 .. method:: write_direct(source, source_sel=None, dest_sel=None) 489 490 Write data directly to HDF5 from a NumPy array. 491 The source array must be C-contiguous. Selections must be 492 the output of numpy.s_[<args>]. 493 Broadcasting is supported for simple indexing. 494 495 496 .. method:: astype(dtype) 497 498 Return a wrapper allowing you to read data as a particular 499 type. Conversion is handled by HDF5 directly, on the fly:: 500 501 >>> dset = f.create_dataset("bigint", (1000,), dtype='int64') 502 >>> out = dset.astype('int16')[:] 503 >>> out.dtype 504 dtype('int16') 505 506 .. versionchanged:: 3.0 507 Allowed reading through the wrapper object. In earlier versions, 508 :meth:`astype` had to be used as a context manager: 509 510 >>> with dset.astype('int16'): 511 ... out = dset[:] 512 513 .. method:: asstr(encoding=None, errors='strict') 514 515 Only for string datasets. Returns a wrapper to read data as Python 516 string objects:: 517 518 >>> s = dataset.asstr()[0] 519 520 encoding and errors work like ``bytes.decode()``, but the default 521 encoding is defined by the datatype - ASCII or UTF-8. 522 This is not guaranteed to be correct. 523 524 .. versionadded:: 3.0 525 526 .. method:: fields(names) 527 528 Get a wrapper to read a subset of fields from a compound data type:: 529 530 >>> 2d_coords = dataset.fields(['x', 'y'])[:] 531 532 If names is a string, a single field is extracted, and the resulting 533 arrays will have that dtype. Otherwise, it should be an iterable, 534 and the read data will have a compound dtype. 535 536 .. versionadded:: 3.0 537 538 .. method:: iter_chunks 539 540 Iterate over chunks in a chunked dataset. The optional ``sel`` argument 541 is a slice or tuple of slices that defines the region to be used. 542 If not set, the entire dataspace will be used for the iterator. 543 544 For each chunk within the given region, the iterator yields a tuple of 545 slices that gives the intersection of the given chunk with the 546 selection area. This can be used to :ref:`read or write data in that 547 chunk <dataset_slicing>`. 548 549 A TypeError will be raised if the dataset is not chunked. 550 551 A ValueError will be raised if the selection region is invalid. 552 553 .. versionadded:: 3.0 554 555 .. method:: resize(size, axis=None) 556 557 Change the shape of a dataset. `size` may be a tuple giving the new 558 dataset shape, or an integer giving the new length of the specified 559 `axis`. 560 561 Datasets may be resized only up to :attr:`Dataset.maxshape`. 562 563 .. method:: len() 564 565 Return the size of the first axis. 566 567 .. method:: make_scale(name='') 568 569 Make this dataset an HDF5 :ref:`dimension scale <dimension_scales>`. 570 571 You can then attach it to dimensions of other datasets like this:: 572 573 other_ds.dims[0].attach_scale(ds) 574 575 You can optionally pass a name to associate with this scale. 576 577 .. method:: virtual_sources 578 579 If this dataset is a :doc:`virtual dataset </vds>`, return a list of 580 named tuples: ``(vspace, file_name, dset_name, src_space)``, 581 describing which parts of the dataset map to which source datasets. 582 The two 'space' members are low-level 583 :class:`SpaceID <low:h5py.h5s.SpaceID>` objects. 584 585 .. attribute:: shape 586 587 NumPy-style shape tuple giving dataset dimensions. 588 589 .. attribute:: dtype 590 591 NumPy dtype object giving the dataset's type. 592 593 .. attribute:: size 594 595 Integer giving the total number of elements in the dataset. 596 597 .. attribute:: nbytes 598 599 Integer giving the total number of bytes required to load the full dataset into RAM (i.e. `dset[()]`). 600 This may not be the amount of disk space occupied by the dataset, 601 as datasets may be compressed when written or only partly filled with data. 602 This value also does not include the array overhead, as it only describes the size of the data itself. 603 Thus the real amount of RAM occupied by this dataset may be slightly greater. 604 605 .. versionadded:: 3.0 606 607 .. attribute:: ndim 608 609 Integer giving the total number of dimensions in the dataset. 610 611 .. attribute:: maxshape 612 613 NumPy-style shape tuple indicating the maximum dimensions up to which 614 the dataset may be resized. Axes with ``None`` are unlimited. 615 616 .. attribute:: chunks 617 618 Tuple giving the chunk shape, or None if chunked storage is not used. 619 See :ref:`dataset_chunks`. 620 621 .. attribute:: compression 622 623 String with the currently applied compression filter, or None if 624 compression is not enabled for this dataset. See :ref:`dataset_compression`. 625 626 .. attribute:: compression_opts 627 628 Options for the compression filter. See :ref:`dataset_compression`. 629 630 .. attribute:: scaleoffset 631 632 Setting for the HDF5 scale-offset filter (integer), or None if 633 scale-offset compression is not used for this dataset. 634 See :ref:`dataset_scaleoffset`. 635 636 .. attribute:: shuffle 637 638 Whether the shuffle filter is applied (T/F). See :ref:`dataset_shuffle`. 639 640 .. attribute:: fletcher32 641 642 Whether Fletcher32 checksumming is enabled (T/F). See :ref:`dataset_fletcher32`. 643 644 .. attribute:: fillvalue 645 646 Value used when reading uninitialized portions of the dataset, or None 647 if no fill value has been defined, in which case HDF5 will use a 648 type-appropriate default value. Can't be changed after the dataset is 649 created. 650 651 .. attribute:: external 652 653 If this dataset is stored in one or more external files, this is a list 654 of 3-tuples, like the ``external=`` parameter to 655 :meth:`Group.create_dataset`. Otherwise, it is ``None``. 656 657 .. attribute:: is_virtual 658 659 True if this dataset is a :doc:`virtual dataset </vds>`, otherwise False. 660 661 .. attribute:: dims 662 663 Access to :ref:`dimension_scales`. 664 665 .. attribute:: attrs 666 667 :ref:`attributes` for this dataset. 668 669 .. attribute:: id 670 671 The dataset's low-level identifier; an instance of 672 :class:`DatasetID <low:h5py.h5d.DatasetID>`. 673 674 .. attribute:: ref 675 676 An HDF5 object reference pointing to this dataset. See 677 :ref:`refs_object`. 678 679 .. attribute:: regionref 680 681 Proxy object for creating HDF5 region references. See 682 :ref:`refs_region`. 683 684 .. attribute:: name 685 686 String giving the full path to this dataset. 687 688 .. attribute:: file 689 690 :class:`File` instance in which this dataset resides 691 692 .. attribute:: parent 693 694 :class:`Group` instance containing this dataset. 695