1.. _astropy-visualization-hist:
2
3***********************
4Choosing Histogram Bins
5***********************
6
7The :mod:`astropy.visualization` module provides the
8:func:`~astropy.visualization.hist` function, which is a generalization of
9matplotlib's histogram function which allows for more flexible specification
10of histogram bins. For computing bins without the accompanying plot, see
11:func:`astropy.stats.histogram`.
12
13As a motivation for this, consider the following two histograms, which are
14constructed from the same underlying set of 5000 points, the first with
15matplotlib's default of 10 bins, the second with an arbitrarily chosen
16200 bins:
17
18.. plot::
19   :align: center
20   :include-source:
21
22    import numpy as np
23    import matplotlib.pyplot as plt
24
25    # generate some complicated data
26    rng = np.random.default_rng(0)
27    t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500),
28                        -4 + 0.8 * rng.standard_cauchy(2000),
29                        -1 + 0.3 * rng.standard_cauchy(500),
30                        2 + 0.8 * rng.standard_cauchy(1000),
31                        4 + 1.5 * rng.standard_cauchy(1000)])
32
33    # truncate to a reasonable range
34    t = t[(t > -15) & (t < 15)]
35
36    # draw histograms with two different bin widths
37    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
38
39    fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15)
40    for i, bins in enumerate([10, 200]):
41        ax[i].hist(t, bins=bins, histtype='stepfilled', alpha=0.2, density=True)
42        ax[i].set_xlabel('t')
43        ax[i].set_ylabel('P(t)')
44        ax[i].set_title(f'plt.hist(t, bins={bins})',
45                        fontdict=dict(family='monospace'))
46
47Upon visual inspection, it is clear that each of these choices is suboptimal:
48with 10 bins, the fine structure of the data distribution is lost, while with
49200 bins, heights of individual bins are affected by sampling error.
50The tried-and-true method employed by most scientists is a trial and error
51approach that attempts to find a suitable midpoint between these.
52
53Astropy's :func:`~astropy.visualization.hist` function addresses this by
54providing several methods of automatically tuning the histogram bin size.
55It has a syntax identical to matplotlib's ``plt.hist`` function, with the
56exception of the ``bins`` parameter, which allows specification of one of
57four different methods for automatic bin selection. These methods are
58implemented in :func:`astropy.stats.histogram`, which has a similar syntax
59to the ``np.histogram`` function.
60
61Normal Reference Rules
62======================
63The simplest methods of tuning the number of bins are the normal reference
64rules due to Scott (implemented in :func:`~astropy.stats.scott_bin_width`) and
65Freedman & Diaconis (implemented in :func:`~astropy.stats.freedman_bin_width`).
66These rules proceed by assuming the data is close to normally-distributed, and
67applying a rule-of-thumb intended to minimize the difference between the
68histogram and the underlying distribution of data.
69
70The following figure shows the results of these two rules on the above dataset:
71
72.. plot::
73   :align: center
74   :include-source:
75
76    import numpy as np
77    import matplotlib.pyplot as plt
78    from astropy.visualization import hist
79
80    # generate some complicated data
81    rng = np.random.default_rng(0)
82    t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500),
83                        -4 + 0.8 * rng.standard_cauchy(2000),
84                        -1 + 0.3 * rng.standard_cauchy(500),
85                        2 + 0.8 * rng.standard_cauchy(1000),
86                        4 + 1.5 * rng.standard_cauchy(1000)])
87
88    # truncate to a reasonable range
89    t = t[(t > -15) & (t < 15)]
90
91    # draw histograms with two different bin widths
92    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
93
94    fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15)
95    for i, bins in enumerate(['scott', 'freedman']):
96        hist(t, bins=bins, ax=ax[i], histtype='stepfilled',
97             alpha=0.2, density=True)
98        ax[i].set_xlabel('t')
99        ax[i].set_ylabel('P(t)')
100        ax[i].set_title(f'hist(t, bins="{bins}")',
101                        fontdict=dict(family='monospace'))
102
103
104As we can see, both of these rules of thumb choose an intermediate number of
105bins which provide a good trade-off between data representation and noise
106suppression.
107
108Bayesian Models
109===============
110
111Though rules-of-thumb like Scott's rule and the Freedman-Diaconis rule are
112fast and convenient, their strong assumptions about the data make them
113suboptimal for more complicated distributions. Other methods of bin selection
114use fitness functions computed on the actual data to choose an optimal binning.
115Astropy implements two of these examples: Knuth's rule (implemented in
116:func:`~astropy.stats.knuth_bin_width`) and Bayesian Blocks (implemented in
117:func:`~astropy.stats.bayesian_blocks`).
118
119Knuth's rule chooses a constant bin size which minimizes the error of the
120histogram's approximation to the data, while the Bayesian Blocks uses a more
121flexible method which allows varying bin widths. Because both of these require
122the minimization of a cost function across the dataset, they are more
123computationally intensive than the rules-of-thumb mentioned above. Here are
124the results of these procedures for the above dataset:
125
126.. plot::
127   :align: center
128   :include-source:
129
130    import warnings
131    import numpy as np
132    import matplotlib.pyplot as plt
133    from astropy.visualization import hist
134
135    # generate some complicated data
136    rng = np.random.default_rng(0)
137    t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500),
138                        -4 + 0.8 * rng.standard_cauchy(2000),
139                        -1 + 0.3 * rng.standard_cauchy(500),
140                        2 + 0.8 * rng.standard_cauchy(1000),
141                        4 + 1.5 * rng.standard_cauchy(1000)])
142
143    # truncate to a reasonable range
144    t = t[(t > -15) & (t < 15)]
145
146    # draw histograms with two different bin widths
147    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
148
149    fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15)
150    for i, bins in enumerate(['knuth', 'blocks']):
151        hist(t, bins=bins, ax=ax[i], histtype='stepfilled',
152                alpha=0.2, density=True)
153        ax[i].set_xlabel('t')
154        ax[i].set_ylabel('P(t)')
155        ax[i].set_title(f'hist(t, bins="{bins}")',
156                        fontdict=dict(family='monospace'))
157
158
159Notice that both of these capture the shape of the distribution very
160accurately, and that the ``bins='blocks'`` panel selects bin widths which vary
161in width depending on the local structure in the data. Compared to standard
162defaults, these Bayesian optimization methods provide a much more principled
163means of choosing histogram binning.
164