1.. _faq:
2
3===========================
4Frequently Asked Questions
5===========================
6
7.. currentmodule:: sklearn
8
9Here we try to give some answers to questions that regularly pop up on the mailing list.
10
11What is the project name (a lot of people get it wrong)?
12--------------------------------------------------------
13scikit-learn, but not scikit or SciKit nor sci-kit learn.
14Also not scikits.learn or scikits-learn, which were previously used.
15
16How do you pronounce the project name?
17------------------------------------------
18sy-kit learn. sci stands for science!
19
20Why scikit?
21------------
22There are multiple scikits, which are scientific toolboxes built around SciPy.
23Apart from scikit-learn, another popular one is `scikit-image <https://scikit-image.org/>`_.
24
25How can I contribute to scikit-learn?
26-----------------------------------------
27See :ref:`contributing`. Before wanting to add a new algorithm, which is
28usually a major and lengthy undertaking, it is recommended to start with
29:ref:`known issues <new_contributors>`. Please do not contact the contributors
30of scikit-learn directly regarding contributing to scikit-learn.
31
32What's the best way to get help on scikit-learn usage?
33--------------------------------------------------------------
34**For general machine learning questions**, please use
35`Cross Validated <https://stats.stackexchange.com/>`_ with the ``[machine-learning]`` tag.
36
37**For scikit-learn usage questions**, please use `Stack Overflow <https://stackoverflow.com/questions/tagged/scikit-learn>`_
38with the ``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list
39<https://mail.python.org/mailman/listinfo/scikit-learn>`_.
40
41Please make sure to include a minimal reproduction code snippet (ideally shorter
42than 10 lines) that highlights your problem on a toy dataset (for instance from
43``sklearn.datasets`` or randomly generated with functions of ``numpy.random`` with
44a fixed random seed). Please remove any line of code that is not necessary to
45reproduce your problem.
46
47The problem should be reproducible by simply copy-pasting your code snippet in a Python
48shell with scikit-learn installed. Do not forget to include the import statements.
49
50More guidance to write good reproduction code snippets can be found at:
51
52https://stackoverflow.com/help/mcve
53
54If your problem raises an exception that you do not understand (even after googling it),
55please make sure to include the full traceback that you obtain when running the
56reproduction script.
57
58For bug reports or feature requests, please make use of the
59`issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues>`_.
60
61There is also a `scikit-learn Gitter channel
62<https://gitter.im/scikit-learn/scikit-learn>`_ where some users and developers
63might be found.
64
65**Please do not email any authors directly to ask for assistance, report bugs,
66or for any other issue related to scikit-learn.**
67
68How should I save, export or deploy estimators for production?
69--------------------------------------------------------------
70
71See :ref:`model_persistence`.
72
73How can I create a bunch object?
74------------------------------------------------
75
76Bunch objects are sometimes used as an output for functions and methods. They
77extend dictionaries by enabling values to be accessed by key,
78`bunch["value_key"]`, or by an attribute, `bunch.value_key`.
79
80They should not be used as an input; therefore you almost never need to create
81a ``Bunch`` object, unless you are extending the scikit-learn's API.
82
83How can I load my own datasets into a format usable by scikit-learn?
84--------------------------------------------------------------------
85
86Generally, scikit-learn works on any numeric data stored as numpy arrays
87or scipy sparse matrices. Other types that are convertible to numeric
88arrays such as pandas DataFrame are also acceptable.
89
90For more information on loading your data files into these usable data
91structures, please refer to :ref:`loading external datasets <external_datasets>`.
92
93.. _new_algorithms_inclusion_criteria:
94
95What are the inclusion criteria for new algorithms ?
96----------------------------------------------------
97
98We only consider well-established algorithms for inclusion. A rule of thumb is
99at least 3 years since publication, 200+ citations, and wide use and
100usefulness. A technique that provides a clear-cut improvement (e.g. an
101enhanced data structure or a more efficient approximation technique) on
102a widely-used method will also be considered for inclusion.
103
104From the algorithms or techniques that meet the above criteria, only those
105which fit well within the current API of scikit-learn, that is a ``fit``,
106``predict/transform`` interface and ordinarily having input/output that is a
107numpy array or sparse matrix, are accepted.
108
109The contributor should support the importance of the proposed addition with
110research papers and/or implementations in other similar packages, demonstrate
111its usefulness via common use-cases/applications and corroborate performance
112improvements, if any, with benchmarks and/or plots. It is expected that the
113proposed algorithm should outperform the methods that are already implemented
114in scikit-learn at least in some areas.
115
116Inclusion of a new algorithm speeding up an existing model is easier if:
117
118- it does not introduce new hyper-parameters (as it makes the library
119  more future-proof),
120- it is easy to document clearly when the contribution improves the speed
121  and when it does not, for instance "when n_features >>
122  n_samples",
123- benchmarks clearly show a speed up.
124
125Also, note that your implementation need not be in scikit-learn to be used
126together with scikit-learn tools. You can implement your favorite algorithm
127in a scikit-learn compatible way, upload it to GitHub and let us know. We
128will be happy to list it under :ref:`related_projects`. If you already have
129a package on GitHub following the scikit-learn API, you may also be
130interested to look at `scikit-learn-contrib
131<https://scikit-learn-contrib.github.io>`_.
132
133.. _selectiveness:
134
135Why are you so selective on what algorithms you include in scikit-learn?
136------------------------------------------------------------------------
137Code comes with maintenance cost, and we need to balance the amount of
138code we have with the size of the team (and add to this the fact that
139complexity scales non linearly with the number of features).
140The package relies on core developers using their free time to
141fix bugs, maintain code and review contributions.
142Any algorithm that is added needs future attention by the developers,
143at which point the original author might long have lost interest.
144See also :ref:`new_algorithms_inclusion_criteria`. For a great read about
145long-term maintenance issues in open-source software, look at
146`the Executive Summary of Roads and Bridges
147<https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_
148
149Why did you remove HMMs from scikit-learn?
150--------------------------------------------
151See :ref:`adding_graphical_models`.
152
153.. _adding_graphical_models:
154
155Will you add graphical models or sequence prediction to scikit-learn?
156---------------------------------------------------------------------
157
158Not in the foreseeable future.
159scikit-learn tries to provide a unified API for the basic tasks in machine
160learning, with pipelines and meta-algorithms like grid search to tie
161everything together. The required concepts, APIs, algorithms and
162expertise required for structured learning are different from what
163scikit-learn has to offer. If we started doing arbitrary structured
164learning, we'd need to redesign the whole package and the project
165would likely collapse under its own weight.
166
167There are two project with API similar to scikit-learn that
168do structured prediction:
169
170* `pystruct <https://pystruct.github.io/>`_ handles general structured
171  learning (focuses on SSVMs on arbitrary graph structures with
172  approximate inference; defines the notion of sample as an instance of
173  the graph structure)
174
175* `seqlearn <https://larsmans.github.io/seqlearn/>`_ handles sequences only
176  (focuses on exact inference; has HMMs, but mostly for the sake of
177  completeness; treats a feature vector as a sample and uses an offset encoding
178  for the dependencies between feature vectors)
179
180Will you add GPU support?
181-------------------------
182
183No, or at least not in the near future. The main reason is that GPU support
184will introduce many software dependencies and introduce platform specific
185issues. scikit-learn is designed to be easy to install on a wide variety of
186platforms. Outside of neural networks, GPUs don't play a large role in machine
187learning today, and much larger gains in speed can often be achieved by a
188careful choice of algorithms.
189
190Do you support PyPy?
191--------------------
192
193In case you didn't know, `PyPy <https://pypy.org/>`_ is an alternative
194Python implementation with a built-in just-in-time compiler. Experimental
195support for PyPy3-v5.10+ has been added, which requires Numpy 1.14.0+,
196and scipy 1.1.0+.
197
198How do I deal with string data (or trees, graphs...)?
199-----------------------------------------------------
200
201scikit-learn estimators assume you'll feed them real-valued feature vectors.
202This assumption is hard-coded in pretty much all of the library.
203However, you can feed non-numerical inputs to estimators in several ways.
204
205If you have text documents, you can use a term frequency features; see
206:ref:`text_feature_extraction` for the built-in *text vectorizers*.
207For more general feature extraction from any kind of data, see
208:ref:`dict_feature_extraction` and :ref:`feature_hashing`.
209
210Another common case is when you have non-numerical data and a custom distance
211(or similarity) metric on these data. Examples include strings with edit
212distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
213encoded as numbers, but doing so is painful and error-prone. Working with
214distance metrics on arbitrary data can be done in two ways.
215
216Firstly, many estimators take precomputed distance/similarity matrices, so if
217the dataset is not too large, you can compute distances for all pairs of inputs.
218If the dataset is large, you can use feature vectors with only one "feature",
219which is an index into a separate data structure, and supply a custom metric
220function that looks up the actual data in this data structure. E.g., to use
221DBSCAN with Levenshtein distances::
222
223    >>> from leven import levenshtein       # doctest: +SKIP
224    >>> import numpy as np
225    >>> from sklearn.cluster import dbscan
226    >>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
227    >>> def lev_metric(x, y):
228    ...     i, j = int(x[0]), int(y[0])     # extract indices
229    ...     return levenshtein(data[i], data[j])
230    ...
231    >>> X = np.arange(len(data)).reshape(-1, 1)
232    >>> X
233    array([[0],
234           [1],
235           [2]])
236    >>> # We need to specify algoritum='brute' as the default assumes
237    >>> # a continuous feature space.
238    >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute')
239    ... # doctest: +SKIP
240    ([0, 1], array([ 0,  0, -1]))
241
242(This uses the third-party edit distance package ``leven``.)
243
244Similar tricks can be used, with some care, for tree kernels, graph kernels,
245etc.
246
247Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
248------------------------------------------------------------------------
249
250Several scikit-learn tools such as ``GridSearchCV`` and ``cross_val_score``
251rely internally on Python's `multiprocessing` module to parallelize execution
252onto several Python processes by passing ``n_jobs > 1`` as an argument.
253
254The problem is that Python ``multiprocessing`` does a ``fork`` system call
255without following it with an ``exec`` system call for performance reasons. Many
256libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
257of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
258manage their own internal thread pool. Upon a call to `fork`, the thread pool
259state in the child process is corrupted: the thread pool believes it has many
260threads while only the main thread state has been forked. It is possible to
261change the libraries to make them detect when a fork happens and reinitialize
262the thread pool in that case: we did that for OpenBLAS (merged upstream in
263main since 0.2.10) and we contributed a `patch
264<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
265(not yet reviewed).
266
267But in the end the real culprit is Python's ``multiprocessing`` that does
268``fork`` without ``exec`` to reduce the overhead of starting and using new
269Python processes for parallel computing. Unfortunately this is a violation of
270the POSIX standard and therefore some software editors like Apple refuse to
271consider the lack of fork-safety in Accelerate / vecLib as a bug.
272
273In Python 3.4+ it is now possible to configure ``multiprocessing`` to
274use the 'forkserver' or 'spawn' start methods (instead of the default
275'fork') to manage the process pools. To work around this issue when
276using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment
277variable to 'forkserver'. However the user should be aware that using
278the 'forkserver' method prevents joblib.Parallel to call function
279interactively defined in a shell session.
280
281If you have custom code that uses ``multiprocessing`` directly instead of using
282it via joblib you can enable the 'forkserver' mode globally for your
283program: Insert the following instructions in your main script::
284
285    import multiprocessing
286
287    # other imports, custom code, load data, define model...
288
289    if __name__ == '__main__':
290        multiprocessing.set_start_method('forkserver')
291
292        # call scikit-learn utils with n_jobs > 1 here
293
294You can find more default on the new start methods in the `multiprocessing
295documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.
296
297.. _faq_mkl_threading:
298
299Why does my job use more cores than specified with n_jobs?
300----------------------------------------------------------
301
302This is because ``n_jobs`` only controls the number of jobs for
303routines that are parallelized with ``joblib``, but parallel code can come
304from other sources:
305
306- some routines may be parallelized with OpenMP (for code written in C or
307  Cython).
308- scikit-learn relies a lot on numpy, which in turn may rely on numerical
309  libraries like MKL, OpenBLAS or BLIS which can provide parallel
310  implementations.
311
312For more details, please refer to our :ref:`Parallelism notes <parallelism>`.
313
314
315Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn?
316--------------------------------------------------------------------------------------------------------------------------------------
317
318Deep learning and reinforcement learning both require a rich vocabulary to
319define an architecture, with deep learning additionally requiring
320GPUs for efficient computing. However, neither of these fit within
321the design constraints of scikit-learn; as a result, deep learning
322and reinforcement learning are currently out of scope for what
323scikit-learn seeks to achieve.
324
325You can find more information about addition of gpu support at
326`Will you add GPU support?`_.
327
328Note that scikit-learn currently implements a simple multilayer perceptron
329in :mod:`sklearn.neural_network`. We will only accept bug fixes for this module.
330If you want to implement more complex deep learning models, please turn to
331popular deep learning frameworks such as
332`tensorflow <https://www.tensorflow.org/>`_,
333`keras <https://keras.io/>`_
334and `pytorch <https://pytorch.org/>`_.
335
336Why is my pull request not getting any attention?
337-------------------------------------------------
338
339The scikit-learn review process takes a significant amount of time, and
340contributors should not be discouraged by a lack of activity or review on
341their pull request. We care a lot about getting things right
342the first time, as maintenance and later change comes at a high cost.
343We rarely release any "experimental" code, so all of our contributions
344will be subject to high use immediately and should be of the highest
345quality possible initially.
346
347Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
348reviewers and core developers are working on scikit-learn on their own time.
349If a review of your pull request comes slowly, it is likely because the
350reviewers are busy. We ask for your understanding and request that you
351not close your pull request or discontinue your work solely because of
352this reason.
353
354How do I set a ``random_state`` for an entire execution?
355---------------------------------------------------------
356
357Please refer to :ref:`randomness`.
358
359Why do categorical variables need preprocessing in scikit-learn, compared to other tools?
360-----------------------------------------------------------------------------------------
361
362Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
363of a single numeric dtype. These do not explicitly represent categorical
364variables at present. Thus, unlike R's data.frames or pandas.DataFrame, we
365require explicit conversion of categorical features to numeric values, as
366discussed in :ref:`preprocessing_categorical_features`.
367See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an
368example of working with heterogeneous (e.g. categorical and numeric) data.
369
370Why does Scikit-learn not directly work with, for example, pandas.DataFrame?
371----------------------------------------------------------------------------
372
373The homogeneous NumPy and SciPy data objects currently expected are most
374efficient to process for most operations. Extensive work would also be needed
375to support Pandas categorical types. Restricting input to homogeneous
376types therefore reduces maintenance cost and encourages usage of efficient
377data structures.
378
379Do you plan to implement transform for target y in a pipeline?
380----------------------------------------------------------------------------
381Currently transform only works for features X in a pipeline.
382There's a long-standing discussion about
383not being able to transform y in a pipeline.
384Follow on github issue
385`#4143 <https://github.com/scikit-learn/scikit-learn/issues/4143>`_.
386Meanwhile check out
387:class:`~compose.TransformedTargetRegressor`,
388`pipegraph <https://github.com/mcasl/PipeGraph>`_,
389`imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
390Note that Scikit-learn solved for the case where y
391has an invertible transformation applied before training
392and inverted after prediction. Scikit-learn intends to solve for
393use cases where y should be transformed at training time
394and not at test time, for resampling and similar uses,
395like at `imbalanced-learn`.
396In general, these use cases can be solved
397with a custom meta estimator rather than a Pipeline
398
399Why are there so many different estimators for linear models?
400-------------------------------------------------------------
401Usually, there is one classifier and one regressor per model type, e.g.
402:class:`~ensemble.GradientBoostingClassifier` and
403:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and
404both have the parameter `loss`, which is especially useful in the regression
405case as it enables the estimation of conditional mean as well as conditional
406quantiles.
407
408For linear models, there are many estimator classes which are very close to
409each other. Let us have a look at
410
411- :class:`~linear_model.LinearRegression`, no penalty
412- :class:`~linear_model.Ridge`, L2 penalty
413- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
414- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
415- :class:`~linear_model.SGDRegressor` with `loss='squared_loss'`
416
417**Maintainer perspective:**
418They all do in principle the same and are different only by the penalty they
419impose. This, however, has a large impact on the way the underlying
420optimization problem is solved. In the end, this amounts to usage of different
421methods and tricks from linear algebra. A special case is `SGDRegressor` which
422comprises all 4 previous models and is different by the optimization procedure.
423A further side effect is that the different estimators favor different data
424layouts (`X` c-contiguous or f-contiguous, sparse csr or csc). This complexity
425of the seemingly simple linear models is the reason for having different
426estimator classes for different penalties.
427
428**User perspective:**
429First, the current design is inspired by the scientific literature where linear
430regression models with different regularization/penalty were given different
431names, e.g. *ridge regression*. Having different model classes with according
432names makes it easier for users to find those regression models.
433Secondly, if all the 5 above mentioned linear models were unified into a single
434class, there would be parameters with a lot of options like the ``solver``
435parameter. On top of that, there would be a lot of exclusive interactions
436between different parameters. For example, the possible options of the
437parameters ``solver``, ``precompute`` and ``selection`` would depend on the
438chosen values of the penalty parameters ``alpha`` and ``l1_ratio``.
439