1Getting Started
2===============
3
4The purpose of this guide is to illustrate some of the main features that
5``scikit-learn`` provides. It assumes a very basic working knowledge of
6machine learning practices (model fitting, predicting, cross-validation,
7etc.). Please refer to our :ref:`installation instructions
8<installation-instructions>` for installing ``scikit-learn``.
9
10``Scikit-learn`` is an open source machine learning library that supports
11supervised and unsupervised learning. It also provides various tools for
12model fitting, data preprocessing, model selection, model evaluation,
13and many other utilities.
14
15Fitting and predicting: estimator basics
16----------------------------------------
17
18``Scikit-learn`` provides dozens of built-in machine learning algorithms and
19models, called :term:`estimators`. Each estimator can be fitted to some data
20using its :term:`fit` method.
21
22Here is a simple example where we fit a
23:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data::
24
25  >>> from sklearn.ensemble import RandomForestClassifier
26  >>> clf = RandomForestClassifier(random_state=0)
27  >>> X = [[ 1,  2,  3],  # 2 samples, 3 features
28  ...      [11, 12, 13]]
29  >>> y = [0, 1]  # classes of each sample
30  >>> clf.fit(X, y)
31  RandomForestClassifier(random_state=0)
32
33The :term:`fit` method generally accepts 2 inputs:
34
35- The samples matrix (or design matrix) :term:`X`. The size of ``X``
36  is typically ``(n_samples, n_features)``, which means that samples are
37  represented as rows and features are represented as columns.
38- The target values :term:`y` which are real numbers for regression tasks, or
39  integers for classification (or any other discrete set of values). For
40  unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is
41  usually 1d array where the ``i`` th entry corresponds to the target of the
42  ``i`` th sample (row) of ``X``.
43
44Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent
45:term:`array-like` data types, though some estimators work with other
46formats such as sparse matrices.
47
48Once the estimator is fitted, it can be used for predicting target values of
49new data. You don't need to re-train the estimator::
50
51  >>> clf.predict(X)  # predict classes of the training data
52  array([0, 1])
53  >>> clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data
54  array([0, 1])
55
56Transformers and pre-processors
57-------------------------------
58
59Machine learning workflows are often composed of different parts. A typical
60pipeline consists of a pre-processing step that transforms or imputes the
61data, and a final predictor that predicts target values.
62
63In ``scikit-learn``, pre-processors and transformers follow the same API as
64the estimator objects (they actually all inherit from the same
65``BaseEstimator`` class). The transformer objects don't have a
66:term:`predict` method but rather a :term:`transform` method that outputs a
67newly transformed sample matrix ``X``::
68
69  >>> from sklearn.preprocessing import StandardScaler
70  >>> X = [[0, 15],
71  ...      [1, -10]]
72  >>> # scale data according to computed scaling values
73  >>> StandardScaler().fit(X).transform(X)
74  array([[-1.,  1.],
75         [ 1., -1.]])
76
77Sometimes, you want to apply different transformations to different features:
78the :ref:`ColumnTransformer<column_transformer>` is designed for these
79use-cases.
80
81Pipelines: chaining pre-processors and estimators
82--------------------------------------------------
83
84Transformers and estimators (predictors) can be combined together into a
85single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline
86offers the same API as a regular estimator: it can be fitted and used for
87prediction with ``fit`` and ``predict``. As we will see later, using a
88pipeline will also prevent you from data leakage, i.e. disclosing some
89testing data in your training data.
90
91In the following example, we :ref:`load the Iris dataset <datasets>`, split it
92into train and test sets, and compute the accuracy score of a pipeline on
93the test data::
94
95  >>> from sklearn.preprocessing import StandardScaler
96  >>> from sklearn.linear_model import LogisticRegression
97  >>> from sklearn.pipeline import make_pipeline
98  >>> from sklearn.datasets import load_iris
99  >>> from sklearn.model_selection import train_test_split
100  >>> from sklearn.metrics import accuracy_score
101  ...
102  >>> # create a pipeline object
103  >>> pipe = make_pipeline(
104  ...     StandardScaler(),
105  ...     LogisticRegression()
106  ... )
107  ...
108  >>> # load the iris dataset and split it into train and test sets
109  >>> X, y = load_iris(return_X_y=True)
110  >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
111  ...
112  >>> # fit the whole pipeline
113  >>> pipe.fit(X_train, y_train)
114  Pipeline(steps=[('standardscaler', StandardScaler()),
115                  ('logisticregression', LogisticRegression())])
116  >>> # we can now use it like any other estimator
117  >>> accuracy_score(pipe.predict(X_test), y_test)
118  0.97...
119
120Model evaluation
121----------------
122
123Fitting a model to some data does not entail that it will predict well on
124unseen data. This needs to be directly evaluated. We have just seen the
125:func:`~sklearn.model_selection.train_test_split` helper that splits a
126dataset into train and test sets, but ``scikit-learn`` provides many other
127tools for model evaluation, in particular for :ref:`cross-validation
128<cross_validation>`.
129
130We here briefly show how to perform a 5-fold cross-validation procedure,
131using the :func:`~sklearn.model_selection.cross_validate` helper. Note that
132it is also possible to manually iterate over the folds, use different
133data splitting strategies, and use custom scoring functions. Please refer to
134our :ref:`User Guide <cross_validation>` for more details::
135
136  >>> from sklearn.datasets import make_regression
137  >>> from sklearn.linear_model import LinearRegression
138  >>> from sklearn.model_selection import cross_validate
139  ...
140  >>> X, y = make_regression(n_samples=1000, random_state=0)
141  >>> lr = LinearRegression()
142  ...
143  >>> result = cross_validate(lr, X, y)  # defaults to 5-fold CV
144  >>> result['test_score']  # r_squared score is high because dataset is easy
145  array([1., 1., 1., 1., 1.])
146
147Automatic parameter searches
148----------------------------
149
150All estimators have parameters (often called hyper-parameters in the
151literature) that can be tuned. The generalization power of an estimator
152often critically depends on a few parameters. For example a
153:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators``
154parameter that determines the number of trees in the forest, and a
155``max_depth`` parameter that determines the maximum depth of each tree.
156Quite often, it is not clear what the exact values of these parameters
157should be since they depend on the data at hand.
158
159``Scikit-learn`` provides tools to automatically find the best parameter
160combinations (via cross-validation). In the following example, we randomly
161search over the parameter space of a random forest with a
162:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
163is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
164a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with
165the best set of parameters. Read more in the :ref:`User Guide
166<grid_search>`::
167
168  >>> from sklearn.datasets import fetch_california_housing
169  >>> from sklearn.ensemble import RandomForestRegressor
170  >>> from sklearn.model_selection import RandomizedSearchCV
171  >>> from sklearn.model_selection import train_test_split
172  >>> from scipy.stats import randint
173  ...
174  >>> X, y = fetch_california_housing(return_X_y=True)
175  >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
176  ...
177  >>> # define the parameter space that will be searched over
178  >>> param_distributions = {'n_estimators': randint(1, 5),
179  ...                        'max_depth': randint(5, 10)}
180  ...
181  >>> # now create a searchCV object and fit it to the data
182  >>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
183  ...                             n_iter=5,
184  ...                             param_distributions=param_distributions,
185  ...                             random_state=0)
186  >>> search.fit(X_train, y_train)
187  RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
188                     param_distributions={'max_depth': ...,
189                                          'n_estimators': ...},
190                     random_state=0)
191  >>> search.best_params_
192  {'max_depth': 9, 'n_estimators': 4}
193
194  >>> # the search object now acts like a normal random forest estimator
195  >>> # with max_depth=9 and n_estimators=4
196  >>> search.score(X_test, y_test)
197  0.73...
198
199.. note::
200
201    In practice, you almost always want to :ref:`search over a pipeline
202    <composite_grid_search>`, instead of a single estimator. One of the main
203    reasons is that if you apply a pre-processing step to the whole dataset
204    without using a pipeline, and then perform any kind of cross-validation,
205    you would be breaking the fundamental assumption of independence between
206    training and testing data. Indeed, since you pre-processed the data
207    using the whole dataset, some information about the test sets are
208    available to the train sets. This will lead to over-estimating the
209    generalization power of the estimator (you can read more in this `Kaggle
210    post <https://www.kaggle.com/alexisbcook/data-leakage>`_).
211
212    Using a pipeline for cross-validation and searching will largely keep
213    you from this common pitfall.
214
215
216Next steps
217----------
218
219We have briefly covered estimator fitting and predicting, pre-processing
220steps, pipelines, cross-validation tools and automatic hyper-parameter
221searches. This guide should give you an overview of some of the main
222features of the library, but there is much more to ``scikit-learn``!
223
224Please refer to our :ref:`user_guide` for details on all the tools that we
225provide. You can also find an exhaustive list of the public API in the
226:ref:`api_ref`.
227
228You can also look at our numerous :ref:`examples <general_examples>` that
229illustrate the use of ``scikit-learn`` in many different contexts.
230
231The :ref:`tutorials <tutorial_menu>` also contain additional learning
232resources.
233