1=============================================
2 MEP28: Remove Complexity from Axes.boxplot
3=============================================
4
5.. contents::
6   :local:
7
8
9Status
10======
11**Discussion**
12
13Branches and Pull requests
14==========================
15
16The following lists any open PRs or branches related to this MEP:
17
18#. Deprecate redundant statistical kwargs in ``Axes.boxplot``: https://github.com/phobson/matplotlib/tree/MEP28-initial-deprecations
19#. Deprecate redundant style options in ``Axes.boxplot``: https://github.com/phobson/matplotlib/tree/MEP28-initial-deprecations
20#. Deprecate passings 2D numpy arrays as input: None
21#. Add pre- & post-processing options to ``cbook.boxplot_stats``: https://github.com/phobson/matplotlib/tree/boxplot-stat-transforms
22#. Exposing ``cbook.boxplot_stats`` through ``Axes.boxplot`` kwargs: None
23#. Remove redundant statistical kwargs in ``Axes.boxplot``: None
24#. Remove redundant style options in ``Axes.boxplot``: None
25#. Remaining items that arise through discussion: None
26
27Abstract
28========
29
30Over the past few releases, the ``Axes.boxplot`` method has grown in
31complexity to support fully customizable artist styling and statistical
32computation. This lead to ``Axes.boxplot`` being split off into multiple
33parts. The statistics needed to draw a boxplot are computed in
34``cbook.boxplot_stats``, while the actual artists are drawn by ``Axes.bxp``.
35The original method, ``Axes.boxplot`` remains as the most public API that
36handles passing the user-supplied data to ``cbook.boxplot_stats``, feeding
37the results to ``Axes.bxp``, and pre-processing style information for
38each facet of the boxplot plots.
39
40This MEP will outline a path forward to rollback the added complexity
41and simplify the API while maintaining reasonable backwards
42compatibility.
43
44Detailed description
45====================
46
47Currently, the ``Axes.boxplot`` method accepts parameters that allow the
48users to specify medians and confidence intervals for each box that
49will be drawn in the plot. These were provided so that avdanced users
50could provide statistics computed in a different fashion that the simple
51method provided by matplotlib. However, handling this input requires
52complex logic to make sure that the forms of the data structure match what
53needs to be drawn. At the moment, that logic contains 9 separate if/else
54statements nested up to 5 levels deep with a for loop, and may raise up to 2 errors.
55These parameters were added prior to the creation of the ``Axes.bxp`` method,
56which draws boxplots from a list of dictionaries containing the relevant
57statistics. Matplotlib also provides a function that computes these
58statistics via ``cbook.boxplot_stats``. Note that advanced users can now
59either a) write their own function to compute the stats required by
60``Axes.bxp``, or b) modify the output returned by ``cbook.boxplots_stats``
61to fully customize the position of the artists of the plots. With this
62flexibility, the parameters to manually specify only the medians and their
63confidences intervals remain for backwards compatibility.
64
65Around the same time that the two roles of ``Axes.boxplot`` were split into
66``cbook.boxplot_stats`` for computation and ``Axes.bxp`` for drawing, both
67``Axes.boxplot`` and ``Axes.bxp`` were written to accept parameters that
68individually toggle the drawing of all components of the boxplots, and
69parameters that individually configure the style of those artists. However,
70to maintain backwards compatibility, the ``sym`` parameter (previously used
71to specify the symbol of the fliers) was retained. This parameter itself
72requires fairly complex logic to reconcile the ``sym`` parameters with the
73newer ``flierprops`` parameter at the default style specified by ``matplotlibrc``.
74
75This MEP seeks to dramatically simplify the creation of boxplots for
76novice and advanced users alike. Importantly, the changes proposed here
77will also be available to downstream packages like seaborn, as seaborn
78smartly allows users to pass arbitrary dictionaries of parameters through
79the seaborn API to the underlying matplotlib functions.
80
81This will be achieved in the following way:
82
83  1. ``cbook.boxplot_stats`` will be modified to allow pre- and post-
84     computation transformation functions to be passed in (e.g., ``np.log``
85     and ``np.exp`` for lognormally distributed data)
86  2. ``Axes.boxplot`` will be modified to also accept and naïvely pass them
87     to ``cbook.boxplots_stats`` (Alt: pass the stat function and a dict
88     of its optional parameters).
89  3. Outdated parameters from ``Axes.boxplot`` will be deprecated and
90     later removed.
91
92Importance
93----------
94
95Since the limits of the whiskers are computed arithmetically, there
96is an implicit assumption of normality in box and whisker plots.
97This primarily affects which data points are classified as outliers.
98
99Allowing transformations to the data and the results used to draw
100boxplots will allow users to opt-out of that assumption if the
101data are known to not fit a normal distribution.
102
103Below is an example of how ``Axes.boxplot`` classifies outliers of lognormal
104data differently depending one these types of transforms.
105
106.. plot::
107   :include-source: true
108
109   import numpy as np
110   import matplotlib.pyplot as plt
111   from matplotlib import cbook
112   np.random.seed(0)
113
114   fig, ax = plt.subplots(figsize=(4, 6))
115   ax.set_yscale('log')
116   data = np.random.lognormal(-1.75, 2.75, size=37)
117
118   stats = cbook.boxplot_stats(data, labels=['arithmetic'])
119   logstats = cbook.boxplot_stats(np.log(data), labels=['log-transformed'])
120
121   for lsdict in logstats:
122       for key, value in lsdict.items():
123           if key != 'label':
124               lsdict[key] = np.exp(value)
125
126   stats.extend(logstats)
127   ax.bxp(stats)
128   fig.show()
129
130Implementation
131==============
132
133Passing transform functions to ``cbook.boxplots_stats``
134-------------------------------------------------------
135
136This MEP proposes that two parameters (e.g., ``transform_in`` and
137``transform_out`` be added to the cookbook function that computes the
138statistics for the boxplot function. These will be optional keyword-only
139arguments and can easily be set to ``lambda x: x`` as a no-op when omitted
140by the user. The ``transform_in`` function will be applied to the data
141as the ``boxplot_stats`` function loops through each subset of the data
142passed to it. After the list of statistics dictionaries are computed the
143``transform_out`` function is applied to each value in the dictionaries.
144
145These transformations can then be added to the call signature of
146``Axes.boxplot`` with little impact to that method's complexity. This is
147because they can be directly passed to ``cbook.boxplot_stats``.
148Alternatively, ``Axes.boxplot`` could be modified to accept an optional
149statistical function kwarg and a dictionary of parameters to be direcly
150passed to it.
151
152At this point in the implementation users and external libraries like
153seaborn would have complete control via the ``Axes.boxplot`` method. More
154importantly, at the very least, seaborn would require no changes to its
155API to allow users to take advantage of these new options.
156
157Simplifications to the ``Axes.boxplot`` API and other functions
158---------------------------------------------------------------
159
160Simplifying the boxplot method consists primarily of deprecating and then
161removing the redundant parameters. Optionally, a next step would include
162rectifying minor terminological inconsistencies between ``Axes.boxplot``
163and ``Axes.bxp``.
164
165The parameters to be deprecated and removed include:
166
167  1. ``usermedians`` - processed by 10 SLOC, 3 ``if`` blocks, a ``for`` loop
168  2. ``conf_intervals`` - handled by 15 SLOC, 6 ``if`` blocks, a ``for`` loop
169  3. ``sym`` - processed by 12 SLOC, 4 ``if`` blocks
170
171Removing the ``sym`` option allows all code in handling the remaining
172styling parameters to be moved to ``Axes.bxp``. This doesn't remove
173any complexity, but does reinforce the single responsibility principle
174among ``Axes.bxp``, ``cbook.boxplot_stats``, and ``Axes.boxplot``.
175
176Additionally, the ``notch`` parameter could be renamed ``shownotches``
177to be consistent with ``Axes.bxp``. This kind of cleanup could be taken
178a step further and the ``whis``, ``bootstrap``, ``autorange`` could
179be rolled into the kwargs passed to the new ``statfxn`` parameter.
180
181Backward compatibility
182======================
183
184Implementation of this MEP would eventually result in the backwards
185incompatible deprecation and then removal of the keyword parameters
186``usermedians``, ``conf_intervals``, and ``sym``. Cursory searches on
187GitHub indicated that ``usermedians``, ``conf_intervals`` are used by
188few users, who all seem to have a very strong knowledge of matplotlib.
189A robust deprecation cycle should provide sufficient time for these
190users to migrate to a new API.
191
192Deprecation of ``sym`` however, may have a much broader reach into
193the matplotlib userbase.
194
195Schedule
196--------
197An accelerated timeline could look like the following:
198
199#. v2.0.1 add transforms to ``cbook.boxplots_stats``, expose in ``Axes.boxplot``
200#. v2.1.0 Initial Deprecations , and using 2D numpy arrays as input
201
202    a. Using 2D numpy arrays as input. The semantics around 2D arrays are generally confusing.
203    b. ``usermedians``, ``conf_intervals``, ``sym`` parameters
204
205#. v2.2.0
206
207    a. remove ``usermedians``, ``conf_intervals``, ``sym`` parameters
208    b. deprecate ``notch`` in favor of ``shownotches`` to be consistent with
209       other parameters and ``Axes.bxp``
210
211#. v2.3.0
212    a. remove ``notch`` parameter
213    b. move all style and artist toggling logic to ``Axes.bxp`` such ``Axes.boxplot``
214       is little more than a broker between ``Axes.bxp`` and ``cbook.boxplots_stats``
215
216
217Anticipated Impacts to Users
218----------------------------
219
220As described above deprecating ``usermedians`` and ``conf_intervals``
221will likely impact few users. Those who will be impacted are almost
222certainly advanced users who will be able to adapt to the change.
223
224Deprecating the ``sym`` option may import more users and effort should
225be taken to collect community feedback on this.
226
227Anticipated Impacts to Downstream Libraries
228-------------------------------------------
229
230The source code (GitHub master as of 2016-10-17) was inspected for
231seaborn and python-ggplot to see if these changes would impact their
232use. None of the parameters nominated for removal in this MEP are used by
233seaborn. The seaborn APIs that use matplotlib's boxplot function allow
234user's to pass arbitrary ``**kwargs`` through to matplotlib's API. Thus
235seaborn users with modern matplotlib installations will be able to take
236full advantage of any new features added as a result of this MEP.
237
238Python-ggplot has implemented its own function to draw boxplots. Therefore,
239no impact can come to it as a result of implementing this MEP.
240
241Alternatives
242============
243
244Variations on the theme
245-----------------------
246
247This MEP can be divided into a few loosely coupled components:
248
249#. Allowing pre- and post-computation transformation function in ``cbook.boxplot_stats``
250#. Exposing that transformation in the ``Axes.boxplot`` API
251#. Removing redundant statistical options in ``Axes.boxplot``
252#. Shifting all styling parameter processing from ``Axes.boxplot`` to ``Axes.bxp``.
253
254With this approach, #2 depends and #1, and #4 depends on #3.
255
256There are two possible approaches to #2. The first and most direct would
257be to mirror the new ``transform_in`` and ``tranform_out`` parameters of
258``cbook.boxplot_stats`` in ``Axes.boxplot`` and pass them directly.
259
260The second approach would be to add ``statfxn`` and ``statfxn_args``
261parameters to ``Axes.boxplot``. Under this implementation, the default
262value of ``statfxn`` would be ``cbook.boxplot_stats``, but users could
263pass their own function. Then ``transform_in`` and ``tranform_out`` would
264then be passed as elements of the ``statfxn_args`` parameter.
265
266.. code:: python
267
268   def boxplot_stats(data, ..., transform_in=None, transform_out=None):
269       if transform_in is None:
270           transform_in = lambda x: x
271
272       if transform_out is None:
273           transform_out = lambda x: x
274
275       output = []
276       for _d in data:
277           d = transform_in(_d)
278           stat_dict = do_stats(d)
279           for key, value in stat_dict.item():
280               if key != 'label':
281                   stat_dict[key] = transform_out(value)
282           output.append(d)
283       return output
284
285
286    class Axes(...):
287        def boxplot_option1(data, ..., transform_in=None, transform_out=None):
288            stats = cbook.boxplot_stats(data, ...,
289                                        transform_in=transform_in,
290                                        transform_out=transform_out)
291            return self.bxp(stats, ...)
292
293        def boxplot_option2(data, ..., statfxn=None, **statopts):
294            if statfxn is None:
295                statfxn = boxplot_stats
296            stats = statfxn(data, **statopts)
297            return self.bxp(stats, ...)
298
299Both cases would allow users to do the following:
300
301.. code:: python
302
303   fig, ax1 = plt.subplots()
304   artists1 = ax1.boxplot_optionX(data, transform_in=np.log,
305                                  transform_out=np.exp)
306
307
308But Option Two lets a user write a completely custom stat function
309(e.g., ``my_box_stats``) with fancy BCA confidence intervals and the
310whiskers set differently depending on some attribute of the data.
311
312This is available under the current API:
313
314.. code:: python
315
316   fig, ax1 = plt.subplots()
317   my_stats = my_box_stats(data, bootstrap_method='BCA',
318                           whisker_method='dynamic')
319   ax1.bxp(my_stats)
320
321And would be more concise with Option Two
322
323.. code:: python
324
325   fig, ax = plt.subplots()
326   statopts = dict(transform_in=np.log, transform_out=np.exp)
327   ax.boxplot(data, ..., **statopts)
328
329Users could also pass their own function to compute the stats:
330
331.. code:: python
332
333   fig, ax1 = plt.subplots()
334   ax1.boxplot(data, statfxn=my_box_stats, bootstrap_method='BCA',
335               whisker_method='dynamic')
336
337From the examples above, Option Two seems to have only marginal benefit,
338but in the context of downstream libraries like seaborn, its advantage
339is more apparent as the following would be possible without any patches
340to seaborn:
341
342.. code:: python
343
344   import seaborn
345   tips = seaborn.load_data('tips')
346   g = seaborn.factorplot(x="day", y="total_bill", hue="sex", data=tips,
347                          kind='box', palette="PRGn", shownotches=True,
348                          statfxn=my_box_stats, bootstrap_method='BCA',
349                          whisker_method='dynamic')
350
351This type of flexibility was the intention behind splitting the overall
352boxplot API in the current three functions. In practice however, downstream
353libraries like seaborn support versions of matplotlib dating back well
354before the split. Thus, adding just a bit more flexibility to the
355``Axes.boxplot`` could expose all the functionality to users of the
356downstream libraries with modern matplotlib installation without intervention
357from the downstream library maintainers.
358
359Doing less
360----------
361
362Another obvious alternative would be to omit the added pre- and post-
363computation transform functionality in ``cbook.boxplot_stats`` and
364``Axes.boxplot``, and simply remove the redundant statistical and style
365parameters as described above.
366
367Doing nothing
368-------------
369
370As with many things in life, doing nothing is an option here. This means
371we simply advocate for users and downstream libraries to take advantage
372of the split between ``cbook.boxplot_stats`` and ``Axes.bxp`` and let
373them decide how to provide an interface to that.
374