1============================================= 2 MEP28: Remove Complexity from Axes.boxplot 3============================================= 4 5.. contents:: 6 :local: 7 8 9Status 10====== 11**Discussion** 12 13Branches and Pull requests 14========================== 15 16The following lists any open PRs or branches related to this MEP: 17 18#. Deprecate redundant statistical kwargs in ``Axes.boxplot``: https://github.com/phobson/matplotlib/tree/MEP28-initial-deprecations 19#. Deprecate redundant style options in ``Axes.boxplot``: https://github.com/phobson/matplotlib/tree/MEP28-initial-deprecations 20#. Deprecate passings 2D numpy arrays as input: None 21#. Add pre- & post-processing options to ``cbook.boxplot_stats``: https://github.com/phobson/matplotlib/tree/boxplot-stat-transforms 22#. Exposing ``cbook.boxplot_stats`` through ``Axes.boxplot`` kwargs: None 23#. Remove redundant statistical kwargs in ``Axes.boxplot``: None 24#. Remove redundant style options in ``Axes.boxplot``: None 25#. Remaining items that arise through discussion: None 26 27Abstract 28======== 29 30Over the past few releases, the ``Axes.boxplot`` method has grown in 31complexity to support fully customizable artist styling and statistical 32computation. This lead to ``Axes.boxplot`` being split off into multiple 33parts. The statistics needed to draw a boxplot are computed in 34``cbook.boxplot_stats``, while the actual artists are drawn by ``Axes.bxp``. 35The original method, ``Axes.boxplot`` remains as the most public API that 36handles passing the user-supplied data to ``cbook.boxplot_stats``, feeding 37the results to ``Axes.bxp``, and pre-processing style information for 38each facet of the boxplot plots. 39 40This MEP will outline a path forward to rollback the added complexity 41and simplify the API while maintaining reasonable backwards 42compatibility. 43 44Detailed description 45==================== 46 47Currently, the ``Axes.boxplot`` method accepts parameters that allow the 48users to specify medians and confidence intervals for each box that 49will be drawn in the plot. These were provided so that avdanced users 50could provide statistics computed in a different fashion that the simple 51method provided by matplotlib. However, handling this input requires 52complex logic to make sure that the forms of the data structure match what 53needs to be drawn. At the moment, that logic contains 9 separate if/else 54statements nested up to 5 levels deep with a for loop, and may raise up to 2 errors. 55These parameters were added prior to the creation of the ``Axes.bxp`` method, 56which draws boxplots from a list of dictionaries containing the relevant 57statistics. Matplotlib also provides a function that computes these 58statistics via ``cbook.boxplot_stats``. Note that advanced users can now 59either a) write their own function to compute the stats required by 60``Axes.bxp``, or b) modify the output returned by ``cbook.boxplots_stats`` 61to fully customize the position of the artists of the plots. With this 62flexibility, the parameters to manually specify only the medians and their 63confidences intervals remain for backwards compatibility. 64 65Around the same time that the two roles of ``Axes.boxplot`` were split into 66``cbook.boxplot_stats`` for computation and ``Axes.bxp`` for drawing, both 67``Axes.boxplot`` and ``Axes.bxp`` were written to accept parameters that 68individually toggle the drawing of all components of the boxplots, and 69parameters that individually configure the style of those artists. However, 70to maintain backwards compatibility, the ``sym`` parameter (previously used 71to specify the symbol of the fliers) was retained. This parameter itself 72requires fairly complex logic to reconcile the ``sym`` parameters with the 73newer ``flierprops`` parameter at the default style specified by ``matplotlibrc``. 74 75This MEP seeks to dramatically simplify the creation of boxplots for 76novice and advanced users alike. Importantly, the changes proposed here 77will also be available to downstream packages like seaborn, as seaborn 78smartly allows users to pass arbitrary dictionaries of parameters through 79the seaborn API to the underlying matplotlib functions. 80 81This will be achieved in the following way: 82 83 1. ``cbook.boxplot_stats`` will be modified to allow pre- and post- 84 computation transformation functions to be passed in (e.g., ``np.log`` 85 and ``np.exp`` for lognormally distributed data) 86 2. ``Axes.boxplot`` will be modified to also accept and naïvely pass them 87 to ``cbook.boxplots_stats`` (Alt: pass the stat function and a dict 88 of its optional parameters). 89 3. Outdated parameters from ``Axes.boxplot`` will be deprecated and 90 later removed. 91 92Importance 93---------- 94 95Since the limits of the whiskers are computed arithmetically, there 96is an implicit assumption of normality in box and whisker plots. 97This primarily affects which data points are classified as outliers. 98 99Allowing transformations to the data and the results used to draw 100boxplots will allow users to opt-out of that assumption if the 101data are known to not fit a normal distribution. 102 103Below is an example of how ``Axes.boxplot`` classifies outliers of lognormal 104data differently depending one these types of transforms. 105 106.. plot:: 107 :include-source: true 108 109 import numpy as np 110 import matplotlib.pyplot as plt 111 from matplotlib import cbook 112 np.random.seed(0) 113 114 fig, ax = plt.subplots(figsize=(4, 6)) 115 ax.set_yscale('log') 116 data = np.random.lognormal(-1.75, 2.75, size=37) 117 118 stats = cbook.boxplot_stats(data, labels=['arithmetic']) 119 logstats = cbook.boxplot_stats(np.log(data), labels=['log-transformed']) 120 121 for lsdict in logstats: 122 for key, value in lsdict.items(): 123 if key != 'label': 124 lsdict[key] = np.exp(value) 125 126 stats.extend(logstats) 127 ax.bxp(stats) 128 fig.show() 129 130Implementation 131============== 132 133Passing transform functions to ``cbook.boxplots_stats`` 134------------------------------------------------------- 135 136This MEP proposes that two parameters (e.g., ``transform_in`` and 137``transform_out`` be added to the cookbook function that computes the 138statistics for the boxplot function. These will be optional keyword-only 139arguments and can easily be set to ``lambda x: x`` as a no-op when omitted 140by the user. The ``transform_in`` function will be applied to the data 141as the ``boxplot_stats`` function loops through each subset of the data 142passed to it. After the list of statistics dictionaries are computed the 143``transform_out`` function is applied to each value in the dictionaries. 144 145These transformations can then be added to the call signature of 146``Axes.boxplot`` with little impact to that method's complexity. This is 147because they can be directly passed to ``cbook.boxplot_stats``. 148Alternatively, ``Axes.boxplot`` could be modified to accept an optional 149statistical function kwarg and a dictionary of parameters to be direcly 150passed to it. 151 152At this point in the implementation users and external libraries like 153seaborn would have complete control via the ``Axes.boxplot`` method. More 154importantly, at the very least, seaborn would require no changes to its 155API to allow users to take advantage of these new options. 156 157Simplifications to the ``Axes.boxplot`` API and other functions 158--------------------------------------------------------------- 159 160Simplifying the boxplot method consists primarily of deprecating and then 161removing the redundant parameters. Optionally, a next step would include 162rectifying minor terminological inconsistencies between ``Axes.boxplot`` 163and ``Axes.bxp``. 164 165The parameters to be deprecated and removed include: 166 167 1. ``usermedians`` - processed by 10 SLOC, 3 ``if`` blocks, a ``for`` loop 168 2. ``conf_intervals`` - handled by 15 SLOC, 6 ``if`` blocks, a ``for`` loop 169 3. ``sym`` - processed by 12 SLOC, 4 ``if`` blocks 170 171Removing the ``sym`` option allows all code in handling the remaining 172styling parameters to be moved to ``Axes.bxp``. This doesn't remove 173any complexity, but does reinforce the single responsibility principle 174among ``Axes.bxp``, ``cbook.boxplot_stats``, and ``Axes.boxplot``. 175 176Additionally, the ``notch`` parameter could be renamed ``shownotches`` 177to be consistent with ``Axes.bxp``. This kind of cleanup could be taken 178a step further and the ``whis``, ``bootstrap``, ``autorange`` could 179be rolled into the kwargs passed to the new ``statfxn`` parameter. 180 181Backward compatibility 182====================== 183 184Implementation of this MEP would eventually result in the backwards 185incompatible deprecation and then removal of the keyword parameters 186``usermedians``, ``conf_intervals``, and ``sym``. Cursory searches on 187GitHub indicated that ``usermedians``, ``conf_intervals`` are used by 188few users, who all seem to have a very strong knowledge of matplotlib. 189A robust deprecation cycle should provide sufficient time for these 190users to migrate to a new API. 191 192Deprecation of ``sym`` however, may have a much broader reach into 193the matplotlib userbase. 194 195Schedule 196-------- 197An accelerated timeline could look like the following: 198 199#. v2.0.1 add transforms to ``cbook.boxplots_stats``, expose in ``Axes.boxplot`` 200#. v2.1.0 Initial Deprecations , and using 2D numpy arrays as input 201 202 a. Using 2D numpy arrays as input. The semantics around 2D arrays are generally confusing. 203 b. ``usermedians``, ``conf_intervals``, ``sym`` parameters 204 205#. v2.2.0 206 207 a. remove ``usermedians``, ``conf_intervals``, ``sym`` parameters 208 b. deprecate ``notch`` in favor of ``shownotches`` to be consistent with 209 other parameters and ``Axes.bxp`` 210 211#. v2.3.0 212 a. remove ``notch`` parameter 213 b. move all style and artist toggling logic to ``Axes.bxp`` such ``Axes.boxplot`` 214 is little more than a broker between ``Axes.bxp`` and ``cbook.boxplots_stats`` 215 216 217Anticipated Impacts to Users 218---------------------------- 219 220As described above deprecating ``usermedians`` and ``conf_intervals`` 221will likely impact few users. Those who will be impacted are almost 222certainly advanced users who will be able to adapt to the change. 223 224Deprecating the ``sym`` option may import more users and effort should 225be taken to collect community feedback on this. 226 227Anticipated Impacts to Downstream Libraries 228------------------------------------------- 229 230The source code (GitHub master as of 2016-10-17) was inspected for 231seaborn and python-ggplot to see if these changes would impact their 232use. None of the parameters nominated for removal in this MEP are used by 233seaborn. The seaborn APIs that use matplotlib's boxplot function allow 234user's to pass arbitrary ``**kwargs`` through to matplotlib's API. Thus 235seaborn users with modern matplotlib installations will be able to take 236full advantage of any new features added as a result of this MEP. 237 238Python-ggplot has implemented its own function to draw boxplots. Therefore, 239no impact can come to it as a result of implementing this MEP. 240 241Alternatives 242============ 243 244Variations on the theme 245----------------------- 246 247This MEP can be divided into a few loosely coupled components: 248 249#. Allowing pre- and post-computation transformation function in ``cbook.boxplot_stats`` 250#. Exposing that transformation in the ``Axes.boxplot`` API 251#. Removing redundant statistical options in ``Axes.boxplot`` 252#. Shifting all styling parameter processing from ``Axes.boxplot`` to ``Axes.bxp``. 253 254With this approach, #2 depends and #1, and #4 depends on #3. 255 256There are two possible approaches to #2. The first and most direct would 257be to mirror the new ``transform_in`` and ``tranform_out`` parameters of 258``cbook.boxplot_stats`` in ``Axes.boxplot`` and pass them directly. 259 260The second approach would be to add ``statfxn`` and ``statfxn_args`` 261parameters to ``Axes.boxplot``. Under this implementation, the default 262value of ``statfxn`` would be ``cbook.boxplot_stats``, but users could 263pass their own function. Then ``transform_in`` and ``tranform_out`` would 264then be passed as elements of the ``statfxn_args`` parameter. 265 266.. code:: python 267 268 def boxplot_stats(data, ..., transform_in=None, transform_out=None): 269 if transform_in is None: 270 transform_in = lambda x: x 271 272 if transform_out is None: 273 transform_out = lambda x: x 274 275 output = [] 276 for _d in data: 277 d = transform_in(_d) 278 stat_dict = do_stats(d) 279 for key, value in stat_dict.item(): 280 if key != 'label': 281 stat_dict[key] = transform_out(value) 282 output.append(d) 283 return output 284 285 286 class Axes(...): 287 def boxplot_option1(data, ..., transform_in=None, transform_out=None): 288 stats = cbook.boxplot_stats(data, ..., 289 transform_in=transform_in, 290 transform_out=transform_out) 291 return self.bxp(stats, ...) 292 293 def boxplot_option2(data, ..., statfxn=None, **statopts): 294 if statfxn is None: 295 statfxn = boxplot_stats 296 stats = statfxn(data, **statopts) 297 return self.bxp(stats, ...) 298 299Both cases would allow users to do the following: 300 301.. code:: python 302 303 fig, ax1 = plt.subplots() 304 artists1 = ax1.boxplot_optionX(data, transform_in=np.log, 305 transform_out=np.exp) 306 307 308But Option Two lets a user write a completely custom stat function 309(e.g., ``my_box_stats``) with fancy BCA confidence intervals and the 310whiskers set differently depending on some attribute of the data. 311 312This is available under the current API: 313 314.. code:: python 315 316 fig, ax1 = plt.subplots() 317 my_stats = my_box_stats(data, bootstrap_method='BCA', 318 whisker_method='dynamic') 319 ax1.bxp(my_stats) 320 321And would be more concise with Option Two 322 323.. code:: python 324 325 fig, ax = plt.subplots() 326 statopts = dict(transform_in=np.log, transform_out=np.exp) 327 ax.boxplot(data, ..., **statopts) 328 329Users could also pass their own function to compute the stats: 330 331.. code:: python 332 333 fig, ax1 = plt.subplots() 334 ax1.boxplot(data, statfxn=my_box_stats, bootstrap_method='BCA', 335 whisker_method='dynamic') 336 337From the examples above, Option Two seems to have only marginal benefit, 338but in the context of downstream libraries like seaborn, its advantage 339is more apparent as the following would be possible without any patches 340to seaborn: 341 342.. code:: python 343 344 import seaborn 345 tips = seaborn.load_data('tips') 346 g = seaborn.factorplot(x="day", y="total_bill", hue="sex", data=tips, 347 kind='box', palette="PRGn", shownotches=True, 348 statfxn=my_box_stats, bootstrap_method='BCA', 349 whisker_method='dynamic') 350 351This type of flexibility was the intention behind splitting the overall 352boxplot API in the current three functions. In practice however, downstream 353libraries like seaborn support versions of matplotlib dating back well 354before the split. Thus, adding just a bit more flexibility to the 355``Axes.boxplot`` could expose all the functionality to users of the 356downstream libraries with modern matplotlib installation without intervention 357from the downstream library maintainers. 358 359Doing less 360---------- 361 362Another obvious alternative would be to omit the added pre- and post- 363computation transform functionality in ``cbook.boxplot_stats`` and 364``Axes.boxplot``, and simply remove the redundant statistical and style 365parameters as described above. 366 367Doing nothing 368------------- 369 370As with many things in life, doing nothing is an option here. This means 371we simply advocate for users and downstream libraries to take advantage 372of the split between ``cbook.boxplot_stats`` and ``Axes.bxp`` and let 373them decide how to provide an interface to that. 374