1.. _groupby: 2 3GroupBy: split-apply-combine 4---------------------------- 5 6Xarray supports `"group by"`__ operations with the same API as pandas to 7implement the `split-apply-combine`__ strategy: 8 9__ http://pandas.pydata.org/pandas-docs/stable/groupby.html 10__ http://www.jstatsoft.org/v40/i01/paper 11 12- Split your data into multiple independent groups. 13- Apply some function to each group. 14- Combine your groups back into a single data object. 15 16Group by operations work on both :py:class:`~xarray.Dataset` and 17:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by 18a single one-dimensional variable, although support for grouping 19over a multi-dimensional variable has recently been implemented. Note that for 20one-dimensional data, it is usually faster to rely on pandas' implementation of 21the same pipeline. 22 23Split 24~~~~~ 25 26Let's create a simple example dataset: 27 28.. ipython:: python 29 :suppress: 30 31 import numpy as np 32 import pandas as pd 33 import xarray as xr 34 35 np.random.seed(123456) 36 37.. ipython:: python 38 39 ds = xr.Dataset( 40 {"foo": (("x", "y"), np.random.rand(4, 3))}, 41 coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))}, 42 ) 43 arr = ds["foo"] 44 ds 45 46If we groupby the name of a variable or coordinate in a dataset (we can also 47use a DataArray directly), we get back a ``GroupBy`` object: 48 49.. ipython:: python 50 51 ds.groupby("letters") 52 53This object works very similarly to a pandas GroupBy object. You can view 54the group indices with the ``groups`` attribute: 55 56.. ipython:: python 57 58 ds.groupby("letters").groups 59 60You can also iterate over groups in ``(label, group)`` pairs: 61 62.. ipython:: python 63 64 list(ds.groupby("letters")) 65 66You can index out a particular group: 67 68.. ipython:: python 69 70 ds.groupby("letters")["b"] 71 72Just like in pandas, creating a GroupBy object is cheap: it does not actually 73split the data until you access particular values. 74 75Binning 76~~~~~~~ 77 78Sometimes you don't want to use all the unique values to determine the groups 79but instead want to "bin" the data into coarser groups. You could always create 80a customized coordinate, but xarray facilitates this via the 81:py:meth:`~xarray.Dataset.groupby_bins` method. 82 83.. ipython:: python 84 85 x_bins = [0, 25, 50] 86 ds.groupby_bins("x", x_bins).groups 87 88The binning is implemented via :func:`pandas.cut`, whose documentation details how 89the bins are assigned. As seen in the example above, by default, the bins are 90labeled with strings using set notation to precisely identify the bin limits. To 91override this behavior, you can specify the bin labels explicitly. Here we 92choose `float` labels which identify the bin centers: 93 94.. ipython:: python 95 96 x_bin_labels = [12.5, 37.5] 97 ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups 98 99 100Apply 101~~~~~ 102 103To apply a function to each group, you can use the flexible 104:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically 105concatenated back together along the group axis: 106 107.. ipython:: python 108 109 def standardize(x): 110 return (x - x.mean()) / x.std() 111 112 113 arr.groupby("letters").map(standardize) 114 115GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and 116methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an 117aggregation function: 118 119.. ipython:: python 120 121 arr.groupby("letters").mean(dim="x") 122 123Using a groupby is thus also a convenient shortcut for aggregating over all 124dimensions *other than* the provided one: 125 126.. ipython:: python 127 128 ds.groupby("x").std(...) 129 130.. note:: 131 132 We use an ellipsis (`...`) here to indicate we want to reduce over all 133 other dimensions 134 135 136First and last 137~~~~~~~~~~~~~~ 138 139There are two special aggregation operations that are currently only found on 140groupby objects: first and last. These provide the first or last example of 141values for group along the grouped dimension: 142 143.. ipython:: python 144 145 ds.groupby("letters").first(...) 146 147By default, they skip missing values (control this with ``skipna``). 148 149Grouped arithmetic 150~~~~~~~~~~~~~~~~~~ 151 152GroupBy objects also support a limited set of binary arithmetic operations, as 153a shortcut for mapping over all unique labels. Binary arithmetic is supported 154for ``(GroupBy, Dataset)`` and ``(GroupBy, DataArray)`` pairs, as long as the 155dataset or data array uses the unique grouped values as one of its index 156coordinates. For example: 157 158.. ipython:: python 159 160 alt = arr.groupby("letters").mean(...) 161 alt 162 ds.groupby("letters") - alt 163 164This last line is roughly equivalent to the following:: 165 166 results = [] 167 for label, group in ds.groupby('letters'): 168 results.append(group - alt.sel(letters=label)) 169 xr.concat(results, dim='x') 170 171Squeezing 172~~~~~~~~~ 173 174When grouping over a dimension, you can control whether the dimension is 175squeezed out or if it should remain with length one on each group by using 176the ``squeeze`` parameter: 177 178.. ipython:: python 179 180 next(iter(arr.groupby("x"))) 181 182.. ipython:: python 183 184 next(iter(arr.groupby("x", squeeze=False))) 185 186Although xarray will attempt to automatically 187:py:attr:`~xarray.DataArray.transpose` dimensions back into their original order 188when you use apply, it is sometimes useful to set ``squeeze=False`` to 189guarantee that all original dimensions remain unchanged. 190 191You can always squeeze explicitly later with the Dataset or DataArray 192:py:meth:`~xarray.DataArray.squeeze` methods. 193 194.. _groupby.multidim: 195 196Multidimensional Grouping 197~~~~~~~~~~~~~~~~~~~~~~~~~ 198 199Many datasets have a multidimensional coordinate variable (e.g. longitude) 200which is different from the logical grid dimensions (e.g. nx, ny). Such 201variables are valid under the `CF conventions`__. Xarray supports groupby 202operations over multidimensional coordinate variables: 203 204__ http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables 205 206.. ipython:: python 207 208 da = xr.DataArray( 209 [[0, 1], [2, 3]], 210 coords={ 211 "lon": (["ny", "nx"], [[30, 40], [40, 50]]), 212 "lat": (["ny", "nx"], [[10, 10], [20, 20]]), 213 }, 214 dims=["ny", "nx"], 215 ) 216 da 217 da.groupby("lon").sum(...) 218 da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False) 219 220Because multidimensional groups have the ability to generate a very large 221number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins` 222may be desirable: 223 224.. ipython:: python 225 226 da.groupby_bins("lon", [0, 45, 50]).sum() 227 228These methods group by `lon` values. It is also possible to groupby each 229cell in a grid, regardless of value, by stacking multiple dimensions, 230applying your function, and then unstacking the result: 231 232.. ipython:: python 233 234 stacked = da.stack(gridcell=["ny", "nx"]) 235 stacked.groupby("gridcell").sum(...).unstack("gridcell") 236