docs/visualization/histogram.rst

.. _astropy-visualization-hist:

***********************
Choosing Histogram Bins
***********************

The :mod:`astropy.visualization` module provides the
:func:`~astropy.visualization.hist` function, which is a generalization of
matplotlib's histogram function which allows for more flexible specification
of histogram bins. For computing bins without the accompanying plot, see
:func:`astropy.stats.histogram`.

As a motivation for this, consider the following two histograms, which are
constructed from the same underlying set of 5000 points, the first with
matplotlib's default of 10 bins, the second with an arbitrarily chosen
200 bins:

.. plot::
   :align: center
   :include-source:

    import numpy as np
    import matplotlib.pyplot as plt

    # generate some complicated data
    rng = np.random.default_rng(0)
    t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500),
                        -4 + 0.8 * rng.standard_cauchy(2000),
                        -1 + 0.3 * rng.standard_cauchy(500),
                        2 + 0.8 * rng.standard_cauchy(1000),
                        4 + 1.5 * rng.standard_cauchy(1000)])

    # truncate to a reasonable range
    t = t[(t > -15) & (t < 15)]

    # draw histograms with two different bin widths
    fig, ax = plt.subplots(1, 2, figsize=(10, 4))

    fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15)
    for i, bins in enumerate([10, 200]):
        ax[i].hist(t, bins=bins, histtype='stepfilled', alpha=0.2, density=True)
        ax[i].set_xlabel('t')
        ax[i].set_ylabel('P(t)')
        ax[i].set_title(f'plt.hist(t, bins={bins})',
                        fontdict=dict(family='monospace'))

Upon visual inspection, it is clear that each of these choices is suboptimal:
with 10 bins, the fine structure of the data distribution is lost, while with
200 bins, heights of individual bins are affected by sampling error.
The tried-and-true method employed by most scientists is a trial and error
approach that attempts to find a suitable midpoint between these.

Astropy's :func:`~astropy.visualization.hist` function addresses this by
providing several methods of automatically tuning the histogram bin size.
It has a syntax identical to matplotlib's ``plt.hist`` function, with the
exception of the ``bins`` parameter, which allows specification of one of
four different methods for automatic bin selection. These methods are
implemented in :func:`astropy.stats.histogram`, which has a similar syntax
to the ``np.histogram`` function.

Normal Reference Rules
======================
The simplest methods of tuning the number of bins are the normal reference
rules due to Scott (implemented in :func:`~astropy.stats.scott_bin_width`) and
Freedman & Diaconis (implemented in :func:`~astropy.stats.freedman_bin_width`).
These rules proceed by assuming the data is close to normally-distributed, and
applying a rule-of-thumb intended to minimize the difference between the
histogram and the underlying distribution of data.

The following figure shows the results of these two rules on the above dataset:

.. plot::
   :align: center
   :include-source:

    import numpy as np
    import matplotlib.pyplot as plt
    from astropy.visualization import hist

    # generate some complicated data
    rng = np.random.default_rng(0)
    t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500),
                        -4 + 0.8 * rng.standard_cauchy(2000),
                        -1 + 0.3 * rng.standard_cauchy(500),
                        2 + 0.8 * rng.standard_cauchy(1000),
                        4 + 1.5 * rng.standard_cauchy(1000)])

    # truncate to a reasonable range
    t = t[(t > -15) & (t < 15)]

    # draw histograms with two different bin widths
    fig, ax = plt.subplots(1, 2, figsize=(10, 4))

    fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15)
    for i, bins in enumerate(['scott', 'freedman']):
        hist(t, bins=bins, ax=ax[i], histtype='stepfilled',
             alpha=0.2, density=True)
        ax[i].set_xlabel('t')
        ax[i].set_ylabel('P(t)')
        ax[i].set_title(f'hist(t, bins="{bins}")',
                        fontdict=dict(family='monospace'))


As we can see, both of these rules of thumb choose an intermediate number of
bins which provide a good trade-off between data representation and noise
suppression.

Bayesian Models
===============

Though rules-of-thumb like Scott's rule and the Freedman-Diaconis rule are
fast and convenient, their strong assumptions about the data make them
suboptimal for more complicated distributions. Other methods of bin selection
use fitness functions computed on the actual data to choose an optimal binning.
Astropy implements two of these examples: Knuth's rule (implemented in
:func:`~astropy.stats.knuth_bin_width`) and Bayesian Blocks (implemented in
:func:`~astropy.stats.bayesian_blocks`).

Knuth's rule chooses a constant bin size which minimizes the error of the
histogram's approximation to the data, while the Bayesian Blocks uses a more
flexible method which allows varying bin widths. Because both of these require
the minimization of a cost function across the dataset, they are more
computationally intensive than the rules-of-thumb mentioned above. Here are
the results of these procedures for the above dataset:

.. plot::
   :align: center
   :include-source:

    import warnings
    import numpy as np
    import matplotlib.pyplot as plt
    from astropy.visualization import hist

    # generate some complicated data
    rng = np.random.default_rng(0)
    t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500),
                        -4 + 0.8 * rng.standard_cauchy(2000),
                        -1 + 0.3 * rng.standard_cauchy(500),
                        2 + 0.8 * rng.standard_cauchy(1000),
                        4 + 1.5 * rng.standard_cauchy(1000)])

    # truncate to a reasonable range
    t = t[(t > -15) & (t < 15)]

    # draw histograms with two different bin widths
    fig, ax = plt.subplots(1, 2, figsize=(10, 4))

    fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15)
    for i, bins in enumerate(['knuth', 'blocks']):
        hist(t, bins=bins, ax=ax[i], histtype='stepfilled',
                alpha=0.2, density=True)
        ax[i].set_xlabel('t')
        ax[i].set_ylabel('P(t)')
        ax[i].set_title(f'hist(t, bins="{bins}")',
                        fontdict=dict(family='monospace'))


Notice that both of these capture the shape of the distribution very
accurately, and that the ``bins='blocks'`` panel selects bin widths which vary
in width depending on the local structure in the data. Compared to standard
defaults, these Bayesian optimization methods provide a much more principled
means of choosing histogram binning.