1.. _astropy-visualization-hist: 2 3*********************** 4Choosing Histogram Bins 5*********************** 6 7The :mod:`astropy.visualization` module provides the 8:func:`~astropy.visualization.hist` function, which is a generalization of 9matplotlib's histogram function which allows for more flexible specification 10of histogram bins. For computing bins without the accompanying plot, see 11:func:`astropy.stats.histogram`. 12 13As a motivation for this, consider the following two histograms, which are 14constructed from the same underlying set of 5000 points, the first with 15matplotlib's default of 10 bins, the second with an arbitrarily chosen 16200 bins: 17 18.. plot:: 19 :align: center 20 :include-source: 21 22 import numpy as np 23 import matplotlib.pyplot as plt 24 25 # generate some complicated data 26 rng = np.random.default_rng(0) 27 t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500), 28 -4 + 0.8 * rng.standard_cauchy(2000), 29 -1 + 0.3 * rng.standard_cauchy(500), 30 2 + 0.8 * rng.standard_cauchy(1000), 31 4 + 1.5 * rng.standard_cauchy(1000)]) 32 33 # truncate to a reasonable range 34 t = t[(t > -15) & (t < 15)] 35 36 # draw histograms with two different bin widths 37 fig, ax = plt.subplots(1, 2, figsize=(10, 4)) 38 39 fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15) 40 for i, bins in enumerate([10, 200]): 41 ax[i].hist(t, bins=bins, histtype='stepfilled', alpha=0.2, density=True) 42 ax[i].set_xlabel('t') 43 ax[i].set_ylabel('P(t)') 44 ax[i].set_title(f'plt.hist(t, bins={bins})', 45 fontdict=dict(family='monospace')) 46 47Upon visual inspection, it is clear that each of these choices is suboptimal: 48with 10 bins, the fine structure of the data distribution is lost, while with 49200 bins, heights of individual bins are affected by sampling error. 50The tried-and-true method employed by most scientists is a trial and error 51approach that attempts to find a suitable midpoint between these. 52 53Astropy's :func:`~astropy.visualization.hist` function addresses this by 54providing several methods of automatically tuning the histogram bin size. 55It has a syntax identical to matplotlib's ``plt.hist`` function, with the 56exception of the ``bins`` parameter, which allows specification of one of 57four different methods for automatic bin selection. These methods are 58implemented in :func:`astropy.stats.histogram`, which has a similar syntax 59to the ``np.histogram`` function. 60 61Normal Reference Rules 62====================== 63The simplest methods of tuning the number of bins are the normal reference 64rules due to Scott (implemented in :func:`~astropy.stats.scott_bin_width`) and 65Freedman & Diaconis (implemented in :func:`~astropy.stats.freedman_bin_width`). 66These rules proceed by assuming the data is close to normally-distributed, and 67applying a rule-of-thumb intended to minimize the difference between the 68histogram and the underlying distribution of data. 69 70The following figure shows the results of these two rules on the above dataset: 71 72.. plot:: 73 :align: center 74 :include-source: 75 76 import numpy as np 77 import matplotlib.pyplot as plt 78 from astropy.visualization import hist 79 80 # generate some complicated data 81 rng = np.random.default_rng(0) 82 t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500), 83 -4 + 0.8 * rng.standard_cauchy(2000), 84 -1 + 0.3 * rng.standard_cauchy(500), 85 2 + 0.8 * rng.standard_cauchy(1000), 86 4 + 1.5 * rng.standard_cauchy(1000)]) 87 88 # truncate to a reasonable range 89 t = t[(t > -15) & (t < 15)] 90 91 # draw histograms with two different bin widths 92 fig, ax = plt.subplots(1, 2, figsize=(10, 4)) 93 94 fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15) 95 for i, bins in enumerate(['scott', 'freedman']): 96 hist(t, bins=bins, ax=ax[i], histtype='stepfilled', 97 alpha=0.2, density=True) 98 ax[i].set_xlabel('t') 99 ax[i].set_ylabel('P(t)') 100 ax[i].set_title(f'hist(t, bins="{bins}")', 101 fontdict=dict(family='monospace')) 102 103 104As we can see, both of these rules of thumb choose an intermediate number of 105bins which provide a good trade-off between data representation and noise 106suppression. 107 108Bayesian Models 109=============== 110 111Though rules-of-thumb like Scott's rule and the Freedman-Diaconis rule are 112fast and convenient, their strong assumptions about the data make them 113suboptimal for more complicated distributions. Other methods of bin selection 114use fitness functions computed on the actual data to choose an optimal binning. 115Astropy implements two of these examples: Knuth's rule (implemented in 116:func:`~astropy.stats.knuth_bin_width`) and Bayesian Blocks (implemented in 117:func:`~astropy.stats.bayesian_blocks`). 118 119Knuth's rule chooses a constant bin size which minimizes the error of the 120histogram's approximation to the data, while the Bayesian Blocks uses a more 121flexible method which allows varying bin widths. Because both of these require 122the minimization of a cost function across the dataset, they are more 123computationally intensive than the rules-of-thumb mentioned above. Here are 124the results of these procedures for the above dataset: 125 126.. plot:: 127 :align: center 128 :include-source: 129 130 import warnings 131 import numpy as np 132 import matplotlib.pyplot as plt 133 from astropy.visualization import hist 134 135 # generate some complicated data 136 rng = np.random.default_rng(0) 137 t = np.concatenate([-5 + 1.8 * rng.standard_cauchy(500), 138 -4 + 0.8 * rng.standard_cauchy(2000), 139 -1 + 0.3 * rng.standard_cauchy(500), 140 2 + 0.8 * rng.standard_cauchy(1000), 141 4 + 1.5 * rng.standard_cauchy(1000)]) 142 143 # truncate to a reasonable range 144 t = t[(t > -15) & (t < 15)] 145 146 # draw histograms with two different bin widths 147 fig, ax = plt.subplots(1, 2, figsize=(10, 4)) 148 149 fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15) 150 for i, bins in enumerate(['knuth', 'blocks']): 151 hist(t, bins=bins, ax=ax[i], histtype='stepfilled', 152 alpha=0.2, density=True) 153 ax[i].set_xlabel('t') 154 ax[i].set_ylabel('P(t)') 155 ax[i].set_title(f'hist(t, bins="{bins}")', 156 fontdict=dict(family='monospace')) 157 158 159Notice that both of these capture the shape of the distribution very 160accurately, and that the ``bins='blocks'`` panel selects bin widths which vary 161in width depending on the local structure in the data. Compared to standard 162defaults, these Bayesian optimization methods provide a much more principled 163means of choosing histogram binning. 164