1.. _faq: 2 3=========================== 4Frequently Asked Questions 5=========================== 6 7.. currentmodule:: sklearn 8 9Here we try to give some answers to questions that regularly pop up on the mailing list. 10 11What is the project name (a lot of people get it wrong)? 12-------------------------------------------------------- 13scikit-learn, but not scikit or SciKit nor sci-kit learn. 14Also not scikits.learn or scikits-learn, which were previously used. 15 16How do you pronounce the project name? 17------------------------------------------ 18sy-kit learn. sci stands for science! 19 20Why scikit? 21------------ 22There are multiple scikits, which are scientific toolboxes built around SciPy. 23Apart from scikit-learn, another popular one is `scikit-image <https://scikit-image.org/>`_. 24 25How can I contribute to scikit-learn? 26----------------------------------------- 27See :ref:`contributing`. Before wanting to add a new algorithm, which is 28usually a major and lengthy undertaking, it is recommended to start with 29:ref:`known issues <new_contributors>`. Please do not contact the contributors 30of scikit-learn directly regarding contributing to scikit-learn. 31 32What's the best way to get help on scikit-learn usage? 33-------------------------------------------------------------- 34**For general machine learning questions**, please use 35`Cross Validated <https://stats.stackexchange.com/>`_ with the ``[machine-learning]`` tag. 36 37**For scikit-learn usage questions**, please use `Stack Overflow <https://stackoverflow.com/questions/tagged/scikit-learn>`_ 38with the ``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list 39<https://mail.python.org/mailman/listinfo/scikit-learn>`_. 40 41Please make sure to include a minimal reproduction code snippet (ideally shorter 42than 10 lines) that highlights your problem on a toy dataset (for instance from 43``sklearn.datasets`` or randomly generated with functions of ``numpy.random`` with 44a fixed random seed). Please remove any line of code that is not necessary to 45reproduce your problem. 46 47The problem should be reproducible by simply copy-pasting your code snippet in a Python 48shell with scikit-learn installed. Do not forget to include the import statements. 49 50More guidance to write good reproduction code snippets can be found at: 51 52https://stackoverflow.com/help/mcve 53 54If your problem raises an exception that you do not understand (even after googling it), 55please make sure to include the full traceback that you obtain when running the 56reproduction script. 57 58For bug reports or feature requests, please make use of the 59`issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues>`_. 60 61There is also a `scikit-learn Gitter channel 62<https://gitter.im/scikit-learn/scikit-learn>`_ where some users and developers 63might be found. 64 65**Please do not email any authors directly to ask for assistance, report bugs, 66or for any other issue related to scikit-learn.** 67 68How should I save, export or deploy estimators for production? 69-------------------------------------------------------------- 70 71See :ref:`model_persistence`. 72 73How can I create a bunch object? 74------------------------------------------------ 75 76Bunch objects are sometimes used as an output for functions and methods. They 77extend dictionaries by enabling values to be accessed by key, 78`bunch["value_key"]`, or by an attribute, `bunch.value_key`. 79 80They should not be used as an input; therefore you almost never need to create 81a ``Bunch`` object, unless you are extending the scikit-learn's API. 82 83How can I load my own datasets into a format usable by scikit-learn? 84-------------------------------------------------------------------- 85 86Generally, scikit-learn works on any numeric data stored as numpy arrays 87or scipy sparse matrices. Other types that are convertible to numeric 88arrays such as pandas DataFrame are also acceptable. 89 90For more information on loading your data files into these usable data 91structures, please refer to :ref:`loading external datasets <external_datasets>`. 92 93.. _new_algorithms_inclusion_criteria: 94 95What are the inclusion criteria for new algorithms ? 96---------------------------------------------------- 97 98We only consider well-established algorithms for inclusion. A rule of thumb is 99at least 3 years since publication, 200+ citations, and wide use and 100usefulness. A technique that provides a clear-cut improvement (e.g. an 101enhanced data structure or a more efficient approximation technique) on 102a widely-used method will also be considered for inclusion. 103 104From the algorithms or techniques that meet the above criteria, only those 105which fit well within the current API of scikit-learn, that is a ``fit``, 106``predict/transform`` interface and ordinarily having input/output that is a 107numpy array or sparse matrix, are accepted. 108 109The contributor should support the importance of the proposed addition with 110research papers and/or implementations in other similar packages, demonstrate 111its usefulness via common use-cases/applications and corroborate performance 112improvements, if any, with benchmarks and/or plots. It is expected that the 113proposed algorithm should outperform the methods that are already implemented 114in scikit-learn at least in some areas. 115 116Inclusion of a new algorithm speeding up an existing model is easier if: 117 118- it does not introduce new hyper-parameters (as it makes the library 119 more future-proof), 120- it is easy to document clearly when the contribution improves the speed 121 and when it does not, for instance "when n_features >> 122 n_samples", 123- benchmarks clearly show a speed up. 124 125Also, note that your implementation need not be in scikit-learn to be used 126together with scikit-learn tools. You can implement your favorite algorithm 127in a scikit-learn compatible way, upload it to GitHub and let us know. We 128will be happy to list it under :ref:`related_projects`. If you already have 129a package on GitHub following the scikit-learn API, you may also be 130interested to look at `scikit-learn-contrib 131<https://scikit-learn-contrib.github.io>`_. 132 133.. _selectiveness: 134 135Why are you so selective on what algorithms you include in scikit-learn? 136------------------------------------------------------------------------ 137Code comes with maintenance cost, and we need to balance the amount of 138code we have with the size of the team (and add to this the fact that 139complexity scales non linearly with the number of features). 140The package relies on core developers using their free time to 141fix bugs, maintain code and review contributions. 142Any algorithm that is added needs future attention by the developers, 143at which point the original author might long have lost interest. 144See also :ref:`new_algorithms_inclusion_criteria`. For a great read about 145long-term maintenance issues in open-source software, look at 146`the Executive Summary of Roads and Bridges 147<https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_ 148 149Why did you remove HMMs from scikit-learn? 150-------------------------------------------- 151See :ref:`adding_graphical_models`. 152 153.. _adding_graphical_models: 154 155Will you add graphical models or sequence prediction to scikit-learn? 156--------------------------------------------------------------------- 157 158Not in the foreseeable future. 159scikit-learn tries to provide a unified API for the basic tasks in machine 160learning, with pipelines and meta-algorithms like grid search to tie 161everything together. The required concepts, APIs, algorithms and 162expertise required for structured learning are different from what 163scikit-learn has to offer. If we started doing arbitrary structured 164learning, we'd need to redesign the whole package and the project 165would likely collapse under its own weight. 166 167There are two project with API similar to scikit-learn that 168do structured prediction: 169 170* `pystruct <https://pystruct.github.io/>`_ handles general structured 171 learning (focuses on SSVMs on arbitrary graph structures with 172 approximate inference; defines the notion of sample as an instance of 173 the graph structure) 174 175* `seqlearn <https://larsmans.github.io/seqlearn/>`_ handles sequences only 176 (focuses on exact inference; has HMMs, but mostly for the sake of 177 completeness; treats a feature vector as a sample and uses an offset encoding 178 for the dependencies between feature vectors) 179 180Will you add GPU support? 181------------------------- 182 183No, or at least not in the near future. The main reason is that GPU support 184will introduce many software dependencies and introduce platform specific 185issues. scikit-learn is designed to be easy to install on a wide variety of 186platforms. Outside of neural networks, GPUs don't play a large role in machine 187learning today, and much larger gains in speed can often be achieved by a 188careful choice of algorithms. 189 190Do you support PyPy? 191-------------------- 192 193In case you didn't know, `PyPy <https://pypy.org/>`_ is an alternative 194Python implementation with a built-in just-in-time compiler. Experimental 195support for PyPy3-v5.10+ has been added, which requires Numpy 1.14.0+, 196and scipy 1.1.0+. 197 198How do I deal with string data (or trees, graphs...)? 199----------------------------------------------------- 200 201scikit-learn estimators assume you'll feed them real-valued feature vectors. 202This assumption is hard-coded in pretty much all of the library. 203However, you can feed non-numerical inputs to estimators in several ways. 204 205If you have text documents, you can use a term frequency features; see 206:ref:`text_feature_extraction` for the built-in *text vectorizers*. 207For more general feature extraction from any kind of data, see 208:ref:`dict_feature_extraction` and :ref:`feature_hashing`. 209 210Another common case is when you have non-numerical data and a custom distance 211(or similarity) metric on these data. Examples include strings with edit 212distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be 213encoded as numbers, but doing so is painful and error-prone. Working with 214distance metrics on arbitrary data can be done in two ways. 215 216Firstly, many estimators take precomputed distance/similarity matrices, so if 217the dataset is not too large, you can compute distances for all pairs of inputs. 218If the dataset is large, you can use feature vectors with only one "feature", 219which is an index into a separate data structure, and supply a custom metric 220function that looks up the actual data in this data structure. E.g., to use 221DBSCAN with Levenshtein distances:: 222 223 >>> from leven import levenshtein # doctest: +SKIP 224 >>> import numpy as np 225 >>> from sklearn.cluster import dbscan 226 >>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"] 227 >>> def lev_metric(x, y): 228 ... i, j = int(x[0]), int(y[0]) # extract indices 229 ... return levenshtein(data[i], data[j]) 230 ... 231 >>> X = np.arange(len(data)).reshape(-1, 1) 232 >>> X 233 array([[0], 234 [1], 235 [2]]) 236 >>> # We need to specify algoritum='brute' as the default assumes 237 >>> # a continuous feature space. 238 >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute') 239 ... # doctest: +SKIP 240 ([0, 1], array([ 0, 0, -1])) 241 242(This uses the third-party edit distance package ``leven``.) 243 244Similar tricks can be used, with some care, for tree kernels, graph kernels, 245etc. 246 247Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux? 248------------------------------------------------------------------------ 249 250Several scikit-learn tools such as ``GridSearchCV`` and ``cross_val_score`` 251rely internally on Python's `multiprocessing` module to parallelize execution 252onto several Python processes by passing ``n_jobs > 1`` as an argument. 253 254The problem is that Python ``multiprocessing`` does a ``fork`` system call 255without following it with an ``exec`` system call for performance reasons. Many 256libraries like (some versions of) Accelerate / vecLib under OSX, (some versions 257of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others), 258manage their own internal thread pool. Upon a call to `fork`, the thread pool 259state in the child process is corrupted: the thread pool believes it has many 260threads while only the main thread state has been forked. It is possible to 261change the libraries to make them detect when a fork happens and reinitialize 262the thread pool in that case: we did that for OpenBLAS (merged upstream in 263main since 0.2.10) and we contributed a `patch 264<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime 265(not yet reviewed). 266 267But in the end the real culprit is Python's ``multiprocessing`` that does 268``fork`` without ``exec`` to reduce the overhead of starting and using new 269Python processes for parallel computing. Unfortunately this is a violation of 270the POSIX standard and therefore some software editors like Apple refuse to 271consider the lack of fork-safety in Accelerate / vecLib as a bug. 272 273In Python 3.4+ it is now possible to configure ``multiprocessing`` to 274use the 'forkserver' or 'spawn' start methods (instead of the default 275'fork') to manage the process pools. To work around this issue when 276using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment 277variable to 'forkserver'. However the user should be aware that using 278the 'forkserver' method prevents joblib.Parallel to call function 279interactively defined in a shell session. 280 281If you have custom code that uses ``multiprocessing`` directly instead of using 282it via joblib you can enable the 'forkserver' mode globally for your 283program: Insert the following instructions in your main script:: 284 285 import multiprocessing 286 287 # other imports, custom code, load data, define model... 288 289 if __name__ == '__main__': 290 multiprocessing.set_start_method('forkserver') 291 292 # call scikit-learn utils with n_jobs > 1 here 293 294You can find more default on the new start methods in the `multiprocessing 295documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_. 296 297.. _faq_mkl_threading: 298 299Why does my job use more cores than specified with n_jobs? 300---------------------------------------------------------- 301 302This is because ``n_jobs`` only controls the number of jobs for 303routines that are parallelized with ``joblib``, but parallel code can come 304from other sources: 305 306- some routines may be parallelized with OpenMP (for code written in C or 307 Cython). 308- scikit-learn relies a lot on numpy, which in turn may rely on numerical 309 libraries like MKL, OpenBLAS or BLIS which can provide parallel 310 implementations. 311 312For more details, please refer to our :ref:`Parallelism notes <parallelism>`. 313 314 315Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn? 316-------------------------------------------------------------------------------------------------------------------------------------- 317 318Deep learning and reinforcement learning both require a rich vocabulary to 319define an architecture, with deep learning additionally requiring 320GPUs for efficient computing. However, neither of these fit within 321the design constraints of scikit-learn; as a result, deep learning 322and reinforcement learning are currently out of scope for what 323scikit-learn seeks to achieve. 324 325You can find more information about addition of gpu support at 326`Will you add GPU support?`_. 327 328Note that scikit-learn currently implements a simple multilayer perceptron 329in :mod:`sklearn.neural_network`. We will only accept bug fixes for this module. 330If you want to implement more complex deep learning models, please turn to 331popular deep learning frameworks such as 332`tensorflow <https://www.tensorflow.org/>`_, 333`keras <https://keras.io/>`_ 334and `pytorch <https://pytorch.org/>`_. 335 336Why is my pull request not getting any attention? 337------------------------------------------------- 338 339The scikit-learn review process takes a significant amount of time, and 340contributors should not be discouraged by a lack of activity or review on 341their pull request. We care a lot about getting things right 342the first time, as maintenance and later change comes at a high cost. 343We rarely release any "experimental" code, so all of our contributions 344will be subject to high use immediately and should be of the highest 345quality possible initially. 346 347Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the 348reviewers and core developers are working on scikit-learn on their own time. 349If a review of your pull request comes slowly, it is likely because the 350reviewers are busy. We ask for your understanding and request that you 351not close your pull request or discontinue your work solely because of 352this reason. 353 354How do I set a ``random_state`` for an entire execution? 355--------------------------------------------------------- 356 357Please refer to :ref:`randomness`. 358 359Why do categorical variables need preprocessing in scikit-learn, compared to other tools? 360----------------------------------------------------------------------------------------- 361 362Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices 363of a single numeric dtype. These do not explicitly represent categorical 364variables at present. Thus, unlike R's data.frames or pandas.DataFrame, we 365require explicit conversion of categorical features to numeric values, as 366discussed in :ref:`preprocessing_categorical_features`. 367See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an 368example of working with heterogeneous (e.g. categorical and numeric) data. 369 370Why does Scikit-learn not directly work with, for example, pandas.DataFrame? 371---------------------------------------------------------------------------- 372 373The homogeneous NumPy and SciPy data objects currently expected are most 374efficient to process for most operations. Extensive work would also be needed 375to support Pandas categorical types. Restricting input to homogeneous 376types therefore reduces maintenance cost and encourages usage of efficient 377data structures. 378 379Do you plan to implement transform for target y in a pipeline? 380---------------------------------------------------------------------------- 381Currently transform only works for features X in a pipeline. 382There's a long-standing discussion about 383not being able to transform y in a pipeline. 384Follow on github issue 385`#4143 <https://github.com/scikit-learn/scikit-learn/issues/4143>`_. 386Meanwhile check out 387:class:`~compose.TransformedTargetRegressor`, 388`pipegraph <https://github.com/mcasl/PipeGraph>`_, 389`imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_. 390Note that Scikit-learn solved for the case where y 391has an invertible transformation applied before training 392and inverted after prediction. Scikit-learn intends to solve for 393use cases where y should be transformed at training time 394and not at test time, for resampling and similar uses, 395like at `imbalanced-learn`. 396In general, these use cases can be solved 397with a custom meta estimator rather than a Pipeline 398 399Why are there so many different estimators for linear models? 400------------------------------------------------------------- 401Usually, there is one classifier and one regressor per model type, e.g. 402:class:`~ensemble.GradientBoostingClassifier` and 403:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and 404both have the parameter `loss`, which is especially useful in the regression 405case as it enables the estimation of conditional mean as well as conditional 406quantiles. 407 408For linear models, there are many estimator classes which are very close to 409each other. Let us have a look at 410 411- :class:`~linear_model.LinearRegression`, no penalty 412- :class:`~linear_model.Ridge`, L2 penalty 413- :class:`~linear_model.Lasso`, L1 penalty (sparse models) 414- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models) 415- :class:`~linear_model.SGDRegressor` with `loss='squared_loss'` 416 417**Maintainer perspective:** 418They all do in principle the same and are different only by the penalty they 419impose. This, however, has a large impact on the way the underlying 420optimization problem is solved. In the end, this amounts to usage of different 421methods and tricks from linear algebra. A special case is `SGDRegressor` which 422comprises all 4 previous models and is different by the optimization procedure. 423A further side effect is that the different estimators favor different data 424layouts (`X` c-contiguous or f-contiguous, sparse csr or csc). This complexity 425of the seemingly simple linear models is the reason for having different 426estimator classes for different penalties. 427 428**User perspective:** 429First, the current design is inspired by the scientific literature where linear 430regression models with different regularization/penalty were given different 431names, e.g. *ridge regression*. Having different model classes with according 432names makes it easier for users to find those regression models. 433Secondly, if all the 5 above mentioned linear models were unified into a single 434class, there would be parameters with a lot of options like the ``solver`` 435parameter. On top of that, there would be a lot of exclusive interactions 436between different parameters. For example, the possible options of the 437parameters ``solver``, ``precompute`` and ``selection`` would depend on the 438chosen values of the penalty parameters ``alpha`` and ``l1_ratio``. 439