1.. _groupby:
2
3GroupBy: split-apply-combine
4----------------------------
5
6Xarray supports `"group by"`__ operations with the same API as pandas to
7implement the `split-apply-combine`__ strategy:
8
9__ http://pandas.pydata.org/pandas-docs/stable/groupby.html
10__ http://www.jstatsoft.org/v40/i01/paper
11
12- Split your data into multiple independent groups.
13- Apply some function to each group.
14- Combine your groups back into a single data object.
15
16Group by operations work on both :py:class:`~xarray.Dataset` and
17:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
18a single one-dimensional variable, although support for grouping
19over a multi-dimensional variable has recently been implemented. Note that for
20one-dimensional data, it is usually faster to rely on pandas' implementation of
21the same pipeline.
22
23Split
24~~~~~
25
26Let's create a simple example dataset:
27
28.. ipython:: python
29    :suppress:
30
31    import numpy as np
32    import pandas as pd
33    import xarray as xr
34
35    np.random.seed(123456)
36
37.. ipython:: python
38
39    ds = xr.Dataset(
40        {"foo": (("x", "y"), np.random.rand(4, 3))},
41        coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))},
42    )
43    arr = ds["foo"]
44    ds
45
46If we groupby the name of a variable or coordinate in a dataset (we can also
47use a DataArray directly), we get back a ``GroupBy`` object:
48
49.. ipython:: python
50
51    ds.groupby("letters")
52
53This object works very similarly to a pandas GroupBy object. You can view
54the group indices with the ``groups`` attribute:
55
56.. ipython:: python
57
58    ds.groupby("letters").groups
59
60You can also iterate over groups in ``(label, group)`` pairs:
61
62.. ipython:: python
63
64    list(ds.groupby("letters"))
65
66You can index out a particular group:
67
68.. ipython:: python
69
70    ds.groupby("letters")["b"]
71
72Just like in pandas, creating a GroupBy object is cheap: it does not actually
73split the data until you access particular values.
74
75Binning
76~~~~~~~
77
78Sometimes you don't want to use all the unique values to determine the groups
79but instead want to "bin" the data into coarser groups. You could always create
80a customized coordinate, but xarray facilitates this via the
81:py:meth:`~xarray.Dataset.groupby_bins` method.
82
83.. ipython:: python
84
85    x_bins = [0, 25, 50]
86    ds.groupby_bins("x", x_bins).groups
87
88The binning is implemented via :func:`pandas.cut`, whose documentation details how
89the bins are assigned. As seen in the example above, by default, the bins are
90labeled with strings using set notation to precisely identify the bin limits. To
91override this behavior, you can specify the bin labels explicitly. Here we
92choose `float` labels which identify the bin centers:
93
94.. ipython:: python
95
96    x_bin_labels = [12.5, 37.5]
97    ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups
98
99
100Apply
101~~~~~
102
103To apply a function to each group, you can use the flexible
104:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
105concatenated back together along the group axis:
106
107.. ipython:: python
108
109    def standardize(x):
110        return (x - x.mean()) / x.std()
111
112
113    arr.groupby("letters").map(standardize)
114
115GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and
116methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
117aggregation function:
118
119.. ipython:: python
120
121    arr.groupby("letters").mean(dim="x")
122
123Using a groupby is thus also a convenient shortcut for aggregating over all
124dimensions *other than* the provided one:
125
126.. ipython:: python
127
128    ds.groupby("x").std(...)
129
130.. note::
131
132    We use an ellipsis (`...`) here to indicate we want to reduce over all
133    other dimensions
134
135
136First and last
137~~~~~~~~~~~~~~
138
139There are two special aggregation operations that are currently only found on
140groupby objects: first and last. These provide the first or last example of
141values for group along the grouped dimension:
142
143.. ipython:: python
144
145    ds.groupby("letters").first(...)
146
147By default, they skip missing values (control this with ``skipna``).
148
149Grouped arithmetic
150~~~~~~~~~~~~~~~~~~
151
152GroupBy objects also support a limited set of binary arithmetic operations, as
153a shortcut for mapping over all unique labels. Binary arithmetic is supported
154for ``(GroupBy, Dataset)`` and ``(GroupBy, DataArray)`` pairs, as long as the
155dataset or data array uses the unique grouped values as one of its index
156coordinates. For example:
157
158.. ipython:: python
159
160    alt = arr.groupby("letters").mean(...)
161    alt
162    ds.groupby("letters") - alt
163
164This last line is roughly equivalent to the following::
165
166    results = []
167    for label, group in ds.groupby('letters'):
168        results.append(group - alt.sel(letters=label))
169    xr.concat(results, dim='x')
170
171Squeezing
172~~~~~~~~~
173
174When grouping over a dimension, you can control whether the dimension is
175squeezed out or if it should remain with length one on each group by using
176the ``squeeze`` parameter:
177
178.. ipython:: python
179
180    next(iter(arr.groupby("x")))
181
182.. ipython:: python
183
184    next(iter(arr.groupby("x", squeeze=False)))
185
186Although xarray will attempt to automatically
187:py:attr:`~xarray.DataArray.transpose` dimensions back into their original order
188when you use apply, it is sometimes useful to set ``squeeze=False`` to
189guarantee that all original dimensions remain unchanged.
190
191You can always squeeze explicitly later with the Dataset or DataArray
192:py:meth:`~xarray.DataArray.squeeze` methods.
193
194.. _groupby.multidim:
195
196Multidimensional Grouping
197~~~~~~~~~~~~~~~~~~~~~~~~~
198
199Many datasets have a multidimensional coordinate variable (e.g. longitude)
200which is different from the logical grid dimensions (e.g. nx, ny). Such
201variables are valid under the `CF conventions`__. Xarray supports groupby
202operations over multidimensional coordinate variables:
203
204__ http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables
205
206.. ipython:: python
207
208    da = xr.DataArray(
209        [[0, 1], [2, 3]],
210        coords={
211            "lon": (["ny", "nx"], [[30, 40], [40, 50]]),
212            "lat": (["ny", "nx"], [[10, 10], [20, 20]]),
213        },
214        dims=["ny", "nx"],
215    )
216    da
217    da.groupby("lon").sum(...)
218    da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
219
220Because multidimensional groups have the ability to generate a very large
221number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
222may be desirable:
223
224.. ipython:: python
225
226    da.groupby_bins("lon", [0, 45, 50]).sum()
227
228These methods group by `lon` values. It is also possible to groupby each
229cell in a grid, regardless of value, by stacking multiple dimensions,
230applying your function, and then unstacking the result:
231
232.. ipython:: python
233
234    stacked = da.stack(gridcell=["ny", "nx"])
235    stacked.groupby("gridcell").sum(...).unstack("gridcell")
236