1
2.. _permutation_importance:
3
4Permutation feature importance
5==============================
6
7.. currentmodule:: sklearn.inspection
8
9Permutation feature importance is a model inspection technique that can be used
10for any :term:`fitted` :term:`estimator` when the data is tabular. This is
11especially useful for non-linear or opaque :term:`estimators`. The permutation
12feature importance is defined to be the decrease in a model score when a single
13feature value is randomly shuffled [1]_. This procedure breaks the relationship
14between the feature and the target, thus the drop in the model score is
15indicative of how much the model depends on the feature. This technique
16benefits from being model agnostic and can be calculated many times with
17different permutations of the feature.
18
19.. warning::
20
21  Features that are deemed of **low importance for a bad model** (low
22  cross-validation score) could be **very important for a good model**.
23  Therefore it is always important to evaluate the predictive power of a model
24  using a held-out set (or better with cross-validation) prior to computing
25  importances. Permutation importance does not reflect to the intrinsic
26  predictive value of a feature by itself but **how important this feature is
27  for a particular model**.
28
29The :func:`permutation_importance` function calculates the feature importance
30of :term:`estimators` for a given dataset. The ``n_repeats`` parameter sets the
31number of times a feature is randomly shuffled and returns a sample of feature
32importances.
33
34Let's consider the following trained regression model::
35
36  >>> from sklearn.datasets import load_diabetes
37  >>> from sklearn.model_selection import train_test_split
38  >>> from sklearn.linear_model import Ridge
39  >>> diabetes = load_diabetes()
40  >>> X_train, X_val, y_train, y_val = train_test_split(
41  ...     diabetes.data, diabetes.target, random_state=0)
42  ...
43  >>> model = Ridge(alpha=1e-2).fit(X_train, y_train)
44  >>> model.score(X_val, y_val)
45  0.356...
46
47Its validation performance, measured via the :math:`R^2` score, is
48significantly larger than the chance level. This makes it possible to use the
49:func:`permutation_importance` function to probe which features are most
50predictive::
51
52  >>> from sklearn.inspection import permutation_importance
53  >>> r = permutation_importance(model, X_val, y_val,
54  ...                            n_repeats=30,
55  ...                            random_state=0)
56  ...
57  >>> for i in r.importances_mean.argsort()[::-1]:
58  ...     if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
59  ...         print(f"{diabetes.feature_names[i]:<8}"
60  ...               f"{r.importances_mean[i]:.3f}"
61  ...               f" +/- {r.importances_std[i]:.3f}")
62  ...
63  s5      0.204 +/- 0.050
64  bmi     0.176 +/- 0.048
65  bp      0.088 +/- 0.033
66  sex     0.056 +/- 0.023
67
68Note that the importance values for the top features represent a large
69fraction of the reference score of 0.356.
70
71Permutation importances can be computed either on the training set or on a
72held-out testing or validation set. Using a held-out set makes it possible to
73highlight which features contribute the most to the generalization power of the
74inspected model. Features that are important on the training set but not on the
75held-out set might cause the model to overfit.
76
77The permutation feature importance is the decrease in a model score when a single
78feature value is randomly shuffled. The score function to be used for the
79computation of importances can be specified with the `scoring` argument,
80which also accepts multiple scorers. Using multiple scorers is more computationally
81efficient than sequentially calling :func:`permutation_importance` several times
82with a different scorer, as it reuses model predictions.
83
84An example of using multiple scorers is shown below, employing a list of metrics,
85but more input formats are possible, as documented in :ref:`multimetric_scoring`.
86
87  >>> scoring = ['r2', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error']
88  >>> r_multi = permutation_importance(
89  ...     model, X_val, y_val, n_repeats=30, random_state=0, scoring=scoring)
90  ...
91  >>> for metric in r_multi:
92  ...     print(f"{metric}")
93  ...     r = r_multi[metric]
94  ...     for i in r.importances_mean.argsort()[::-1]:
95  ...         if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
96  ...             print(f"    {diabetes.feature_names[i]:<8}"
97  ...                   f"{r.importances_mean[i]:.3f}"
98  ...                   f" +/- {r.importances_std[i]:.3f}")
99  ...
100  r2
101    s5      0.204 +/- 0.050
102    bmi     0.176 +/- 0.048
103    bp      0.088 +/- 0.033
104    sex     0.056 +/- 0.023
105  neg_mean_absolute_percentage_error
106    s5      0.081 +/- 0.020
107    bmi     0.064 +/- 0.015
108    bp      0.029 +/- 0.010
109  neg_mean_squared_error
110    s5      1013.903 +/- 246.460
111    bmi     872.694 +/- 240.296
112    bp      438.681 +/- 163.025
113    sex     277.382 +/- 115.126
114
115The ranking of the features is approximately the same for different metrics even
116if the scales of the importance values are very different. However, this is not
117guaranteed and different metrics might lead to significantly different feature
118importances, in particular for models trained for imbalanced classification problems,
119for which the choice of the classification metric can be critical.
120
121Outline of the permutation importance algorithm
122-----------------------------------------------
123
124- Inputs: fitted predictive model :math:`m`, tabular dataset (training or
125  validation) :math:`D`.
126- Compute the reference score :math:`s` of the model :math:`m` on data
127  :math:`D` (for instance the accuracy for a classifier or the :math:`R^2` for
128  a regressor).
129- For each feature :math:`j` (column of :math:`D`):
130
131  - For each repetition :math:`k` in :math:`{1, ..., K}`:
132
133    - Randomly shuffle column :math:`j` of dataset :math:`D` to generate a
134      corrupted version of the data named :math:`\tilde{D}_{k,j}`.
135    - Compute the score :math:`s_{k,j}` of model :math:`m` on corrupted data
136      :math:`\tilde{D}_{k,j}`.
137
138  - Compute importance :math:`i_j` for feature :math:`f_j` defined as:
139
140    .. math:: i_j = s - \frac{1}{K} \sum_{k=1}^{K} s_{k,j}
141
142Relation to impurity-based importance in trees
143----------------------------------------------
144
145Tree-based models provide an alternative measure of :ref:`feature importances
146based on the mean decrease in impurity <random_forest_feature_importance>`
147(MDI). Impurity is quantified by the splitting criterion of the decision trees
148(Gini, Entropy or Mean Squared Error). However, this method can give high
149importance to features that may not be predictive on unseen data when the model
150is overfitting. Permutation-based feature importance, on the other hand, avoids
151this issue, since it can be computed on unseen data.
152
153Furthermore, impurity-based feature importance for trees are **strongly
154biased** and **favor high cardinality features** (typically numerical features)
155over low cardinality features such as binary features or categorical variables
156with a small number of possible categories.
157
158Permutation-based feature importances do not exhibit such a bias. Additionally,
159the permutation feature importance may be computed performance metric on the
160model predictions predictions and can be used to analyze any model class (not
161just tree-based models).
162
163The following example highlights the limitations of impurity-based feature
164importance in contrast to permutation-based feature importance:
165:ref:`sphx_glr_auto_examples_inspection_plot_permutation_importance.py`.
166
167Misleading values on strongly correlated features
168-------------------------------------------------
169
170When two features are correlated and one of the features is permuted, the model
171will still have access to the feature through its correlated feature. This will
172result in a lower importance value for both features, where they might
173*actually* be important.
174
175One way to handle this is to cluster features that are correlated and only
176keep one feature from each cluster. This strategy is explored in the following
177example:
178:ref:`sphx_glr_auto_examples_inspection_plot_permutation_importance_multicollinear.py`.
179
180.. topic:: Examples:
181
182  * :ref:`sphx_glr_auto_examples_inspection_plot_permutation_importance.py`
183  * :ref:`sphx_glr_auto_examples_inspection_plot_permutation_importance_multicollinear.py`
184
185.. topic:: References:
186
187   .. [1] L. Breiman, :doi:`"Random Forests" <10.1023/A:1010933404324>`,
188      Machine Learning, 45(1), 5-32, 2001.
189