1.. currentmodule:: h5py
2.. _dataset:
3
4
5Datasets
6========
7
8Datasets are very similar to NumPy arrays.  They are homogeneous collections of
9data elements, with an immutable datatype and (hyper)rectangular shape.
10Unlike NumPy arrays, they support a variety of transparent storage features
11such as compression, error-detection, and chunked I/O.
12
13They are represented in h5py by a thin proxy class which supports familiar
14NumPy operations like slicing, along with a variety of descriptive attributes:
15
16  - **shape** attribute
17  - **size** attribute
18  - **ndim** attribute
19  - **dtype** attribute
20  - **nbytes** attribute
21
22h5py supports most NumPy dtypes, and uses the same character codes (e.g.
23``'f'``, ``'i8'``) and dtype machinery as
24`Numpy <https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html>`_.
25See :ref:`faq` for the list of dtypes h5py supports.
26
27
28.. _dataset_create:
29
30Creating datasets
31-----------------
32
33New datasets are created using either :meth:`Group.create_dataset` or
34:meth:`Group.require_dataset`.  Existing datasets should be retrieved using
35the group indexing syntax (``dset = group["name"]``).
36
37To initialise a dataset, all you have to do is specify a name, shape, and
38optionally the data type (defaults to ``'f'``)::
39
40    >>> dset = f.create_dataset("default", (100,))
41    >>> dset = f.create_dataset("ints", (100,), dtype='i8')
42
43.. note:: This is not the same as creating an :ref:`Empty dataset <dataset_empty>`.
44
45You may also initialize the dataset to an existing NumPy array by providing the `data` parameter::
46
47    >>> arr = np.arange(100)
48    >>> dset = f.create_dataset("init", data=arr)
49
50Keywords ``shape`` and ``dtype`` may be specified along with ``data``; if so,
51they will override ``data.shape`` and ``data.dtype``.  It's required that
52(1) the total number of points in ``shape`` match the total number of points
53in ``data.shape``, and that (2) it's possible to cast ``data.dtype`` to
54the requested ``dtype``.
55
56.. _dataset_slicing:
57
58Reading & writing data
59----------------------
60
61HDF5 datasets re-use the NumPy slicing syntax to read and write to the file.
62Slice specifications are translated directly to HDF5 "hyperslab"
63selections, and are a fast and efficient way to access data in the file. The
64following slicing arguments are recognized:
65
66    * Indices: anything that can be converted to a Python long
67    * Slices (i.e. ``[:]`` or ``[0:10]``)
68    * Field names, in the case of compound data
69    * At most one ``Ellipsis`` (``...``) object
70    * An empty tuple (``()``) to retrieve all data or `scalar` data
71
72Here are a few examples (output omitted).
73
74    >>> dset = f.create_dataset("MyDataset", (10,10,10), 'f')
75    >>> dset[0,0,0]
76    >>> dset[0,2:10,1:9:3]
77    >>> dset[:,::2,5]
78    >>> dset[0]
79    >>> dset[1,5]
80    >>> dset[0,...]
81    >>> dset[...,6]
82    >>> dset[()]
83
84There's more documentation on what parts of numpy's :ref:`fancy indexing <dataset_fancy>` are available in h5py.
85
86For compound data, it is advised to separate field names from the
87numeric slices::
88
89    >>> dset.fields("FieldA")[:10]   # Read a single field
90    >>> dset[:10]["FieldA"]          # Read all fields, select in NumPy
91
92It is also possible to mix indexing and field names (``dset[:10, "FieldA"]``),
93but this might be removed in a future version of h5py.
94
95To retrieve the contents of a `scalar` dataset, you can use the same
96syntax as in NumPy:  ``result = dset[()]``.  In other words, index into
97the dataset using an empty tuple.
98
99For simple slicing, broadcasting is supported:
100
101    >>> dset[0,:,:] = np.arange(10)  # Broadcasts to (10,10)
102
103Broadcasting is implemented using repeated hyperslab selections, and is
104safe to use with very large target selections.  It is supported for the above
105"simple" (integer, slice and ellipsis) slicing only.
106
107.. warning::
108   Currently h5py does not support nested compound types, see :issue:`1197` for
109   more information.
110
111Multiple indexing
112~~~~~~~~~~~~~~~~~
113
114Indexing a dataset once loads a numpy array into memory.
115If you try to index it twice to write data, you may be surprised that nothing
116seems to have happened:
117
118   >>> f = h5py.File('my_hdf5_file.h5', 'w')
119   >>> dset = f.create_dataset("test", (2, 2))
120   >>> dset[0][1] = 3.0  # No effect!
121   >>> print(dset[0][1])
122   0.0
123
124The assignment above only modifies the loaded array. It's equivalent to this:
125
126   >>> new_array = dset[0]
127   >>> new_array[1] = 3.0
128   >>> print(new_array[1])
129   3.0
130   >>> print(dset[0][1])
131   0.0
132
133To write to the dataset, combine the indexes in a single step:
134
135   >>> dset[0, 1] = 3.0
136   >>> print(dset[0, 1])
137   3.0
138
139.. _dataset_iter:
140
141Length and iteration
142~~~~~~~~~~~~~~~~~~~~
143
144As with NumPy arrays, the ``len()`` of a dataset is the length of the first
145axis, and iterating over a dataset iterates over the first axis.  However,
146modifications to the yielded data are not recorded in the file.  Resizing a
147dataset while iterating has undefined results.
148
149On 32-bit platforms, ``len(dataset)`` will fail if the first axis is bigger
150than 2**32. It's recommended to use :meth:`Dataset.len` for large datasets.
151
152.. _dataset_chunks:
153
154Chunked storage
155---------------
156
157An HDF5 dataset created with the default settings will be `contiguous`; in
158other words, laid out on disk in traditional C order.  Datasets may also be
159created using HDF5's `chunked` storage layout.  This means the dataset is
160divided up into regularly-sized pieces which are stored haphazardly on disk,
161and indexed using a B-tree.
162
163Chunked storage makes it possible to resize datasets, and because the data
164is stored in fixed-size chunks, to use compression filters.
165
166To enable chunked storage, set the keyword ``chunks`` to a tuple indicating
167the chunk shape::
168
169    >>> dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100))
170
171Data will be read and written in blocks with shape (100,100); for example,
172the data in ``dset[0:100,0:100]`` will be stored together in the file, as will
173the data points in range ``dset[400:500, 100:200]``.
174
175Chunking has performance implications.  It's recommended to keep the total
176size of your chunks between 10 KiB and 1 MiB, larger for larger datasets.
177Also keep in mind that when any element in a chunk is accessed, the entire
178chunk is read from disk.
179
180Since picking a chunk shape can be confusing, you can have h5py guess a chunk
181shape for you::
182
183    >>> dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)
184
185Auto-chunking is also enabled when using compression or ``maxshape``, etc.,
186if a chunk shape is not manually specified.
187
188The iter_chunks method returns an iterator that can be used to perform chunk by chunk
189reads or writes::
190
191    >>> for s in dset.iter_chunks():
192    >>>     arr = dset[s]  # get numpy array for chunk
193
194
195.. _dataset_resize:
196
197Resizable datasets
198------------------
199
200In HDF5, datasets can be resized once created up to a maximum size,
201by calling :meth:`Dataset.resize`.  You specify this maximum size when creating
202the dataset, via the keyword ``maxshape``::
203
204    >>> dset = f.create_dataset("resizable", (10,10), maxshape=(500, 20))
205
206Any (or all) axes may also be marked as "unlimited", in which case they may
207be increased up to the HDF5 per-axis limit of 2**64 elements.  Indicate these
208axes using ``None``::
209
210    >>> dset = f.create_dataset("unlimited", (10, 10), maxshape=(None, 10))
211
212.. note:: Resizing an array with existing data works differently than in NumPy; if
213    any axis shrinks, the data in the missing region is discarded.  Data does
214    not "rearrange" itself as it does when resizing a NumPy array.
215
216
217.. _dataset_compression:
218
219Filter pipeline
220---------------
221
222Chunked data may be transformed by the HDF5 `filter pipeline`.  The most
223common use is applying transparent compression.  Data is compressed on the
224way to disk, and automatically decompressed when read.  Once the dataset
225is created with a particular compression filter applied, data may be read
226and written as normal with no special steps required.
227
228Enable compression with the ``compression`` keyword to
229:meth:`Group.create_dataset`::
230
231    >>> dset = f.create_dataset("zipped", (100, 100), compression="gzip")
232
233Options for each filter may be specified with ``compression_opts``::
234
235    >>> dset = f.create_dataset("zipped_max", (100, 100), compression="gzip", compression_opts=9)
236
237Lossless compression filters
238~~~~~~~~~~~~~~~~~~~~~~~~~~~~
239
240GZIP filter (``"gzip"``)
241    Available with every installation of HDF5, so it's best where portability is
242    required.  Good compression, moderate speed.  ``compression_opts`` sets the
243    compression level and may be an integer from 0 to 9, default is 4.
244
245
246LZF filter (``"lzf"``)
247    Available with every installation of h5py (C source code also available).
248    Low to moderate compression, very fast.  No options.
249
250
251SZIP filter (``"szip"``)
252    Patent-encumbered filter used in the NASA community.  Not available with all
253    installations of HDF5 due to legal reasons.  Consult the HDF5 docs for filter
254    options.
255
256Custom compression filters
257~~~~~~~~~~~~~~~~~~~~~~~~~~
258
259In addition to the compression filters listed above, compression filters can be
260dynamically loaded by the underlying HDF5 library. This is done by passing a
261filter number to :meth:`Group.create_dataset` as the ``compression`` parameter.
262The ``compression_opts`` parameter will then be passed to this filter.
263
264.. seealso::
265
266   `hdf5plugin <https://pypi.org/project/hdf5plugin/>`_
267     A Python package of several popular filters, including Blosc, LZ4 and ZFP,
268     for convenient use with h5py
269
270   `HDF5 Filter Plugins <https://portal.hdfgroup.org/display/support/HDF5+Filter+Plugins>`_
271     A collection of filters as a single download from The HDF Group
272
273   `Registered filter plugins <https://portal.hdfgroup.org/display/support/Filters>`_
274     The index of publicly announced filter plugins
275
276.. note:: The underlying implementation of the compression filter will have the
277    ``H5Z_FLAG_OPTIONAL`` flag set. This indicates that if the compression
278    filter doesn't compress a block while writing, no error will be thrown. The
279    filter will then be skipped when subsequently reading the block.
280
281
282.. _dataset_scaleoffset:
283
284Scale-Offset filter
285~~~~~~~~~~~~~~~~~~~
286
287Filters enabled with the ``compression`` keywords are *lossless*; what comes
288out of the dataset is exactly what you put in.  HDF5 also includes a lossy
289filter which trades precision for storage space.
290
291Works with integer and floating-point data only.  Enable the scale-offset
292filter by setting :meth:`Group.create_dataset` keyword ``scaleoffset`` to an
293integer.
294
295For integer data, this specifies the number of bits to retain.  Set to 0 to have
296HDF5 automatically compute the number of bits required for lossless compression
297of the chunk.  For floating-point data, indicates the number of digits after
298the decimal point to retain.
299
300.. warning::
301    Currently the scale-offset filter does not preserve special float values
302    (i.e. NaN, inf), see
303    https://forum.hdfgroup.org/t/scale-offset-filter-and-special-float-values-nan-infinity/3379
304    for more information and follow-up.
305
306
307.. _dataset_shuffle:
308
309Shuffle filter
310~~~~~~~~~~~~~~
311
312Block-oriented compressors like GZIP or LZF work better when presented with
313runs of similar values.  Enabling the shuffle filter rearranges the bytes in
314the chunk and may improve compression ratio.  No significant speed penalty,
315lossless.
316
317Enable by setting :meth:`Group.create_dataset` keyword ``shuffle`` to True.
318
319
320.. _dataset_fletcher32:
321
322Fletcher32 filter
323~~~~~~~~~~~~~~~~~
324
325Adds a checksum to each chunk to detect data corruption.  Attempts to read
326corrupted chunks will fail with an error.  No significant speed penalty.
327Obviously shouldn't be used with lossy compression filters.
328
329Enable by setting :meth:`Group.create_dataset` keyword ``fletcher32`` to True.
330
331.. _dataset_multi_block:
332
333Multi-Block Selection
334---------------------
335
336The full H5Sselect_hyperslab API is exposed via the MultiBlockSlice object.
337This takes four elements to define the selection (start, count, stride and
338block) in contrast to the built-in slice object, which takes three elements.
339A MultiBlockSlice can be used in place of a slice to select a number of (count)
340blocks of multiple elements separated by a stride, rather than a set of single
341elements separated by a step.
342
343For an explanation of how this slicing works, see the `HDF5 documentation <https://support.hdfgroup.org/HDF5/Tutor/selectsimple.html>`_.
344
345For example::
346
347    >>> dset[...]
348    array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
349    >>> dset[MultiBlockSlice(start=1, count=3, stride=4, block=2)]
350    array([ 1,  2,  5,  6,  9, 10])
351
352They can be used in multi-dimensional slices alongside any slicing object,
353including other MultiBlockSlices. For a more complete example of this,
354see the multiblockslice_interleave.py example script.
355
356.. _dataset_fancy:
357
358Fancy indexing
359--------------
360
361A subset of the NumPy fancy-indexing syntax is supported.  Use this with
362caution, as the underlying HDF5 mechanisms may have different performance
363than you expect.
364
365For any axis, you can provide an explicit list of points you want; for a
366dataset with shape (10, 10)::
367
368    >>> dset.shape
369    (10, 10)
370    >>> result = dset[0, [1,3,8]]
371    >>> result.shape
372    (3,)
373    >>> result = dset[1:6, [5,8,9]]
374    >>> result.shape
375    (5, 3)
376
377The following restrictions exist:
378
379* Selection coordinates must be given in increasing order
380* Duplicate selections are ignored
381* Very long lists (> 1000 elements) may produce poor performance
382
383NumPy boolean "mask" arrays can also be used to specify a selection.  The
384result of this operation is a 1-D array with elements arranged in the
385standard NumPy (C-style) order.  Behind the scenes, this generates a laundry
386list of points to select, so be careful when using it with large masks::
387
388    >>> arr = numpy.arange(100).reshape((10,10))
389    >>> dset = f.create_dataset("MyDataset", data=arr)
390    >>> result = dset[arr > 50]
391    >>> result.shape
392    (49,)
393
394.. versionchanged:: 2.10
395   Selecting using an empty list is now allowed.
396   This returns an array with length 0 in the relevant dimension.
397
398.. _dataset_empty:
399
400Creating and Reading Empty (or Null) datasets and attributes
401------------------------------------------------------------
402
403HDF5 has the concept of Empty or Null datasets and attributes. These are not
404the same as an array with a shape of (), or a scalar dataspace in HDF5 terms.
405Instead, it is a dataset with an associated type, no data, and no shape. In
406h5py, we represent this as either a dataset with shape ``None``, or an
407instance of ``h5py.Empty``. Empty datasets and attributes cannot be sliced.
408
409To create an empty attribute, use ``h5py.Empty`` as per :ref:`attributes`::
410
411    >>> obj.attrs["EmptyAttr"] = h5py.Empty("f")
412
413Similarly, reading an empty attribute returns ``h5py.Empty``::
414
415    >>> obj.attrs["EmptyAttr"]
416    h5py.Empty(dtype="f")
417
418Empty datasets can be created either by defining a ``dtype`` but no
419``shape`` in ``create_dataset``::
420
421    >>> grp.create_dataset("EmptyDataset", dtype="f")
422
423or by ``data`` to an instance of ``h5py.Empty``::
424
425    >>> grp.create_dataset("EmptyDataset", data=h5py.Empty("f"))
426
427An empty dataset has shape defined as ``None``, which is the best way of
428determining whether a dataset is empty or not. An empty dataset can be "read" in
429a similar way to scalar datasets, i.e. if ``empty_dataset`` is an empty
430dataset::
431
432    >>> empty_dataset[()]
433    h5py.Empty(dtype="f")
434
435The dtype of the dataset can be accessed via ``<dset>.dtype`` as per normal.
436As empty datasets cannot be sliced, some methods of datasets such as
437``read_direct`` will raise a ``TypeError`` exception if used on a empty dataset.
438
439Reference
440---------
441
442.. class:: Dataset(identifier)
443
444    Dataset objects are typically created via :meth:`Group.create_dataset`,
445    or by retrieving existing datasets from a file.  Call this constructor to
446    create a new Dataset bound to an existing
447    :class:`DatasetID <low:h5py.h5d.DatasetID>` identifier.
448
449    .. method:: __getitem__(args)
450
451        NumPy-style slicing to retrieve data.  See :ref:`dataset_slicing`.
452
453    .. method:: __setitem__(args)
454
455        NumPy-style slicing to write data.  See :ref:`dataset_slicing`.
456
457    .. method:: __bool__()
458
459        Check that the dataset is accessible.
460        A dataset could be inaccessible for several reasons. For instance, the
461        dataset, or the file it belongs to, may have been closed elsewhere.
462
463        >>> f = h5py.open(filename)
464        >>> dset = f["MyDS"]
465        >>> f.close()
466        >>> if dset:
467        ...     print("datset accessible")
468        ... else:
469        ...     print("dataset inaccessible")
470        dataset inaccessible
471
472    .. method:: read_direct(array, source_sel=None, dest_sel=None)
473
474        Read from an HDF5 dataset directly into a NumPy array, which can
475        avoid making an intermediate copy as happens with slicing. The
476        destination array must be C-contiguous and writable, and must have
477        a datatype to which the source data may be cast.  Data type conversion
478        will be carried out on the fly by HDF5.
479
480        `source_sel` and `dest_sel` indicate the range of points in the
481        dataset and destination array respectively.  Use the output of
482        ``numpy.s_[args]``::
483
484            >>> dset = f.create_dataset("dset", (100,), dtype='int64')
485            >>> arr = np.zeros((100,), dtype='int32')
486            >>> dset.read_direct(arr, np.s_[0:10], np.s_[50:60])
487
488    .. method:: write_direct(source, source_sel=None, dest_sel=None)
489
490        Write data directly to HDF5 from a NumPy array.
491        The source array must be C-contiguous.  Selections must be
492        the output of numpy.s_[<args>].
493        Broadcasting is supported for simple indexing.
494
495
496    .. method:: astype(dtype)
497
498        Return a wrapper allowing you to read data as a particular
499        type.  Conversion is handled by HDF5 directly, on the fly::
500
501            >>> dset = f.create_dataset("bigint", (1000,), dtype='int64')
502            >>> out = dset.astype('int16')[:]
503            >>> out.dtype
504            dtype('int16')
505
506        .. versionchanged:: 3.0
507           Allowed reading through the wrapper object. In earlier versions,
508           :meth:`astype` had to be used as a context manager:
509
510               >>> with dset.astype('int16'):
511               ...     out = dset[:]
512
513    .. method:: asstr(encoding=None, errors='strict')
514
515       Only for string datasets. Returns a wrapper to read data as Python
516       string objects::
517
518           >>> s = dataset.asstr()[0]
519
520       encoding and errors work like ``bytes.decode()``, but the default
521       encoding is defined by the datatype - ASCII or UTF-8.
522       This is not guaranteed to be correct.
523
524       .. versionadded:: 3.0
525
526    .. method:: fields(names)
527
528        Get a wrapper to read a subset of fields from a compound data type::
529
530            >>> 2d_coords = dataset.fields(['x', 'y'])[:]
531
532        If names is a string, a single field is extracted, and the resulting
533        arrays will have that dtype. Otherwise, it should be an iterable,
534        and the read data will have a compound dtype.
535
536        .. versionadded:: 3.0
537
538    .. method:: iter_chunks
539
540       Iterate over chunks in a chunked dataset. The optional ``sel`` argument
541       is a slice or tuple of slices that defines the region to be used.
542       If not set, the entire dataspace will be used for the iterator.
543
544       For each chunk within the given region, the iterator yields a tuple of
545       slices that gives the intersection of the given chunk with the
546       selection area. This can be used to :ref:`read or write data in that
547       chunk <dataset_slicing>`.
548
549       A TypeError will be raised if the dataset is not chunked.
550
551       A ValueError will be raised if the selection region is invalid.
552
553       .. versionadded:: 3.0
554
555    .. method:: resize(size, axis=None)
556
557        Change the shape of a dataset.  `size` may be a tuple giving the new
558        dataset shape, or an integer giving the new length of the specified
559        `axis`.
560
561        Datasets may be resized only up to :attr:`Dataset.maxshape`.
562
563    .. method:: len()
564
565        Return the size of the first axis.
566
567    .. method:: make_scale(name='')
568
569       Make this dataset an HDF5 :ref:`dimension scale <dimension_scales>`.
570
571       You can then attach it to dimensions of other datasets like this::
572
573           other_ds.dims[0].attach_scale(ds)
574
575       You can optionally pass a name to associate with this scale.
576
577    .. method:: virtual_sources
578
579       If this dataset is a :doc:`virtual dataset </vds>`, return a list of
580       named tuples: ``(vspace, file_name, dset_name, src_space)``,
581       describing which parts of the dataset map to which source datasets.
582       The two 'space' members are low-level
583       :class:`SpaceID <low:h5py.h5s.SpaceID>` objects.
584
585    .. attribute:: shape
586
587        NumPy-style shape tuple giving dataset dimensions.
588
589    .. attribute:: dtype
590
591        NumPy dtype object giving the dataset's type.
592
593    .. attribute:: size
594
595        Integer giving the total number of elements in the dataset.
596
597    .. attribute:: nbytes
598
599        Integer giving the total number of bytes required to load the full dataset into RAM (i.e. `dset[()]`).
600        This may not be the amount of disk space occupied by the dataset,
601        as datasets may be compressed when written or only partly filled with data.
602        This value also does not include the array overhead, as it only describes the size of the data itself.
603        Thus the real amount of RAM occupied by this dataset may be slightly greater.
604
605        .. versionadded:: 3.0
606
607    .. attribute:: ndim
608
609        Integer giving the total number of dimensions in the dataset.
610
611    .. attribute:: maxshape
612
613        NumPy-style shape tuple indicating the maximum dimensions up to which
614        the dataset may be resized.  Axes with ``None`` are unlimited.
615
616    .. attribute:: chunks
617
618        Tuple giving the chunk shape, or None if chunked storage is not used.
619        See :ref:`dataset_chunks`.
620
621    .. attribute:: compression
622
623        String with the currently applied compression filter, or None if
624        compression is not enabled for this dataset.  See :ref:`dataset_compression`.
625
626    .. attribute:: compression_opts
627
628        Options for the compression filter.  See :ref:`dataset_compression`.
629
630    .. attribute:: scaleoffset
631
632        Setting for the HDF5 scale-offset filter (integer), or None if
633        scale-offset compression is not used for this dataset.
634        See :ref:`dataset_scaleoffset`.
635
636    .. attribute:: shuffle
637
638        Whether the shuffle filter is applied (T/F).  See :ref:`dataset_shuffle`.
639
640    .. attribute:: fletcher32
641
642        Whether Fletcher32 checksumming is enabled (T/F).  See :ref:`dataset_fletcher32`.
643
644    .. attribute:: fillvalue
645
646        Value used when reading uninitialized portions of the dataset, or None
647        if no fill value has been defined, in which case HDF5 will use a
648        type-appropriate default value.  Can't be changed after the dataset is
649        created.
650
651    .. attribute:: external
652
653       If this dataset is stored in one or more external files, this is a list
654       of 3-tuples, like the ``external=`` parameter to
655       :meth:`Group.create_dataset`. Otherwise, it is ``None``.
656
657    .. attribute:: is_virtual
658
659       True if this dataset is a :doc:`virtual dataset </vds>`, otherwise False.
660
661    .. attribute:: dims
662
663        Access to :ref:`dimension_scales`.
664
665    .. attribute:: attrs
666
667        :ref:`attributes` for this dataset.
668
669    .. attribute:: id
670
671        The dataset's low-level identifier; an instance of
672        :class:`DatasetID <low:h5py.h5d.DatasetID>`.
673
674    .. attribute:: ref
675
676        An HDF5 object reference pointing to this dataset.  See
677        :ref:`refs_object`.
678
679    .. attribute:: regionref
680
681        Proxy object for creating HDF5 region references.  See
682        :ref:`refs_region`.
683
684    .. attribute:: name
685
686        String giving the full path to this dataset.
687
688    .. attribute:: file
689
690        :class:`File` instance in which this dataset resides
691
692    .. attribute:: parent
693
694        :class:`Group` instance containing this dataset.
695