1Getting Started 2=============== 3 4The purpose of this guide is to illustrate some of the main features that 5``scikit-learn`` provides. It assumes a very basic working knowledge of 6machine learning practices (model fitting, predicting, cross-validation, 7etc.). Please refer to our :ref:`installation instructions 8<installation-instructions>` for installing ``scikit-learn``. 9 10``Scikit-learn`` is an open source machine learning library that supports 11supervised and unsupervised learning. It also provides various tools for 12model fitting, data preprocessing, model selection, model evaluation, 13and many other utilities. 14 15Fitting and predicting: estimator basics 16---------------------------------------- 17 18``Scikit-learn`` provides dozens of built-in machine learning algorithms and 19models, called :term:`estimators`. Each estimator can be fitted to some data 20using its :term:`fit` method. 21 22Here is a simple example where we fit a 23:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data:: 24 25 >>> from sklearn.ensemble import RandomForestClassifier 26 >>> clf = RandomForestClassifier(random_state=0) 27 >>> X = [[ 1, 2, 3], # 2 samples, 3 features 28 ... [11, 12, 13]] 29 >>> y = [0, 1] # classes of each sample 30 >>> clf.fit(X, y) 31 RandomForestClassifier(random_state=0) 32 33The :term:`fit` method generally accepts 2 inputs: 34 35- The samples matrix (or design matrix) :term:`X`. The size of ``X`` 36 is typically ``(n_samples, n_features)``, which means that samples are 37 represented as rows and features are represented as columns. 38- The target values :term:`y` which are real numbers for regression tasks, or 39 integers for classification (or any other discrete set of values). For 40 unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is 41 usually 1d array where the ``i`` th entry corresponds to the target of the 42 ``i`` th sample (row) of ``X``. 43 44Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent 45:term:`array-like` data types, though some estimators work with other 46formats such as sparse matrices. 47 48Once the estimator is fitted, it can be used for predicting target values of 49new data. You don't need to re-train the estimator:: 50 51 >>> clf.predict(X) # predict classes of the training data 52 array([0, 1]) 53 >>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data 54 array([0, 1]) 55 56Transformers and pre-processors 57------------------------------- 58 59Machine learning workflows are often composed of different parts. A typical 60pipeline consists of a pre-processing step that transforms or imputes the 61data, and a final predictor that predicts target values. 62 63In ``scikit-learn``, pre-processors and transformers follow the same API as 64the estimator objects (they actually all inherit from the same 65``BaseEstimator`` class). The transformer objects don't have a 66:term:`predict` method but rather a :term:`transform` method that outputs a 67newly transformed sample matrix ``X``:: 68 69 >>> from sklearn.preprocessing import StandardScaler 70 >>> X = [[0, 15], 71 ... [1, -10]] 72 >>> # scale data according to computed scaling values 73 >>> StandardScaler().fit(X).transform(X) 74 array([[-1., 1.], 75 [ 1., -1.]]) 76 77Sometimes, you want to apply different transformations to different features: 78the :ref:`ColumnTransformer<column_transformer>` is designed for these 79use-cases. 80 81Pipelines: chaining pre-processors and estimators 82-------------------------------------------------- 83 84Transformers and estimators (predictors) can be combined together into a 85single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline 86offers the same API as a regular estimator: it can be fitted and used for 87prediction with ``fit`` and ``predict``. As we will see later, using a 88pipeline will also prevent you from data leakage, i.e. disclosing some 89testing data in your training data. 90 91In the following example, we :ref:`load the Iris dataset <datasets>`, split it 92into train and test sets, and compute the accuracy score of a pipeline on 93the test data:: 94 95 >>> from sklearn.preprocessing import StandardScaler 96 >>> from sklearn.linear_model import LogisticRegression 97 >>> from sklearn.pipeline import make_pipeline 98 >>> from sklearn.datasets import load_iris 99 >>> from sklearn.model_selection import train_test_split 100 >>> from sklearn.metrics import accuracy_score 101 ... 102 >>> # create a pipeline object 103 >>> pipe = make_pipeline( 104 ... StandardScaler(), 105 ... LogisticRegression() 106 ... ) 107 ... 108 >>> # load the iris dataset and split it into train and test sets 109 >>> X, y = load_iris(return_X_y=True) 110 >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 111 ... 112 >>> # fit the whole pipeline 113 >>> pipe.fit(X_train, y_train) 114 Pipeline(steps=[('standardscaler', StandardScaler()), 115 ('logisticregression', LogisticRegression())]) 116 >>> # we can now use it like any other estimator 117 >>> accuracy_score(pipe.predict(X_test), y_test) 118 0.97... 119 120Model evaluation 121---------------- 122 123Fitting a model to some data does not entail that it will predict well on 124unseen data. This needs to be directly evaluated. We have just seen the 125:func:`~sklearn.model_selection.train_test_split` helper that splits a 126dataset into train and test sets, but ``scikit-learn`` provides many other 127tools for model evaluation, in particular for :ref:`cross-validation 128<cross_validation>`. 129 130We here briefly show how to perform a 5-fold cross-validation procedure, 131using the :func:`~sklearn.model_selection.cross_validate` helper. Note that 132it is also possible to manually iterate over the folds, use different 133data splitting strategies, and use custom scoring functions. Please refer to 134our :ref:`User Guide <cross_validation>` for more details:: 135 136 >>> from sklearn.datasets import make_regression 137 >>> from sklearn.linear_model import LinearRegression 138 >>> from sklearn.model_selection import cross_validate 139 ... 140 >>> X, y = make_regression(n_samples=1000, random_state=0) 141 >>> lr = LinearRegression() 142 ... 143 >>> result = cross_validate(lr, X, y) # defaults to 5-fold CV 144 >>> result['test_score'] # r_squared score is high because dataset is easy 145 array([1., 1., 1., 1., 1.]) 146 147Automatic parameter searches 148---------------------------- 149 150All estimators have parameters (often called hyper-parameters in the 151literature) that can be tuned. The generalization power of an estimator 152often critically depends on a few parameters. For example a 153:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators`` 154parameter that determines the number of trees in the forest, and a 155``max_depth`` parameter that determines the maximum depth of each tree. 156Quite often, it is not clear what the exact values of these parameters 157should be since they depend on the data at hand. 158 159``Scikit-learn`` provides tools to automatically find the best parameter 160combinations (via cross-validation). In the following example, we randomly 161search over the parameter space of a random forest with a 162:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search 163is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as 164a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with 165the best set of parameters. Read more in the :ref:`User Guide 166<grid_search>`:: 167 168 >>> from sklearn.datasets import fetch_california_housing 169 >>> from sklearn.ensemble import RandomForestRegressor 170 >>> from sklearn.model_selection import RandomizedSearchCV 171 >>> from sklearn.model_selection import train_test_split 172 >>> from scipy.stats import randint 173 ... 174 >>> X, y = fetch_california_housing(return_X_y=True) 175 >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 176 ... 177 >>> # define the parameter space that will be searched over 178 >>> param_distributions = {'n_estimators': randint(1, 5), 179 ... 'max_depth': randint(5, 10)} 180 ... 181 >>> # now create a searchCV object and fit it to the data 182 >>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), 183 ... n_iter=5, 184 ... param_distributions=param_distributions, 185 ... random_state=0) 186 >>> search.fit(X_train, y_train) 187 RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5, 188 param_distributions={'max_depth': ..., 189 'n_estimators': ...}, 190 random_state=0) 191 >>> search.best_params_ 192 {'max_depth': 9, 'n_estimators': 4} 193 194 >>> # the search object now acts like a normal random forest estimator 195 >>> # with max_depth=9 and n_estimators=4 196 >>> search.score(X_test, y_test) 197 0.73... 198 199.. note:: 200 201 In practice, you almost always want to :ref:`search over a pipeline 202 <composite_grid_search>`, instead of a single estimator. One of the main 203 reasons is that if you apply a pre-processing step to the whole dataset 204 without using a pipeline, and then perform any kind of cross-validation, 205 you would be breaking the fundamental assumption of independence between 206 training and testing data. Indeed, since you pre-processed the data 207 using the whole dataset, some information about the test sets are 208 available to the train sets. This will lead to over-estimating the 209 generalization power of the estimator (you can read more in this `Kaggle 210 post <https://www.kaggle.com/alexisbcook/data-leakage>`_). 211 212 Using a pipeline for cross-validation and searching will largely keep 213 you from this common pitfall. 214 215 216Next steps 217---------- 218 219We have briefly covered estimator fitting and predicting, pre-processing 220steps, pipelines, cross-validation tools and automatic hyper-parameter 221searches. This guide should give you an overview of some of the main 222features of the library, but there is much more to ``scikit-learn``! 223 224Please refer to our :ref:`user_guide` for details on all the tools that we 225provide. You can also find an exhaustive list of the public API in the 226:ref:`api_ref`. 227 228You can also look at our numerous :ref:`examples <general_examples>` that 229illustrate the use of ``scikit-learn`` in many different contexts. 230 231The :ref:`tutorials <tutorial_menu>` also contain additional learning 232resources. 233