1.. Places parent toc into the sidebar
2
3:parenttoc: True
4
5.. _loading_other_datasets:
6
7Loading other datasets
8======================
9
10.. currentmodule:: sklearn.datasets
11
12.. _sample_images:
13
14Sample images
15-------------
16
17Scikit-learn also embeds a couple of sample JPEG images published under Creative
18Commons license by their authors. Those images can be useful to test algorithms
19and pipelines on 2D data.
20
21.. autosummary::
22
23   load_sample_images
24   load_sample_image
25
26.. image:: ../auto_examples/cluster/images/sphx_glr_plot_color_quantization_001.png
27   :target: ../auto_examples/cluster/plot_color_quantization.html
28   :scale: 30
29   :align: right
30
31
32.. warning::
33
34  The default coding of images is based on the ``uint8`` dtype to
35  spare memory. Often machine learning algorithms work best if the
36  input is converted to a floating point representation first. Also,
37  if you plan to use ``matplotlib.pyplpt.imshow``, don't forget to scale to the range
38  0 - 1 as done in the following example.
39
40.. topic:: Examples:
41
42    * :ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`
43
44.. _libsvm_loader:
45
46Datasets in svmlight / libsvm format
47------------------------------------
48
49scikit-learn includes utility functions for loading
50datasets in the svmlight / libsvm format. In this format, each line
51takes the form ``<label> <feature-id>:<feature-value>
52<feature-id>:<feature-value> ...``. This format is especially suitable for sparse datasets.
53In this module, scipy sparse CSR matrices are used for ``X`` and numpy arrays are used for ``y``.
54
55You may load a dataset like as follows::
56
57  >>> from sklearn.datasets import load_svmlight_file
58  >>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
59  ...                                                         # doctest: +SKIP
60
61You may also load two (or more) datasets at once::
62
63  >>> X_train, y_train, X_test, y_test = load_svmlight_files(
64  ...     ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
65  ...                                                         # doctest: +SKIP
66
67In this case, ``X_train`` and ``X_test`` are guaranteed to have the same number
68of features. Another way to achieve the same result is to fix the number of
69features::
70
71  >>> X_test, y_test = load_svmlight_file(
72  ...     "/path/to/test_dataset.txt", n_features=X_train.shape[1])
73  ...                                                         # doctest: +SKIP
74
75.. topic:: Related links:
76
77 _`Public datasets in svmlight / libsvm format`: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets
78
79 _`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader
80
81..
82    For doctests:
83
84    >>> import numpy as np
85    >>> import os
86
87.. _openml:
88
89Downloading datasets from the openml.org repository
90---------------------------------------------------
91
92`openml.org <https://openml.org>`_ is a public repository for machine learning
93data and experiments, that allows everybody to upload open datasets.
94
95The ``sklearn.datasets`` package is able to download datasets
96from the repository using the function
97:func:`sklearn.datasets.fetch_openml`.
98
99For example, to download a dataset of gene expressions in mice brains::
100
101  >>> from sklearn.datasets import fetch_openml
102  >>> mice = fetch_openml(name='miceprotein', version=4)
103
104To fully specify a dataset, you need to provide a name and a version, though
105the version is optional, see :ref:`openml_versions` below.
106The dataset contains a total of 1080 examples belonging to 8 different
107classes::
108
109  >>> mice.data.shape
110  (1080, 77)
111  >>> mice.target.shape
112  (1080,)
113  >>> np.unique(mice.target)
114  array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)
115
116You can get more information on the dataset by looking at the ``DESCR``
117and ``details`` attributes::
118
119  >>> print(mice.DESCR) # doctest: +SKIP
120  **Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
121  **Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015
122  **Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing
123  Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down
124  Syndrome. PLoS ONE 10(6): e0129126...
125
126  >>> mice.details # doctest: +SKIP
127  {'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
128  'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
129  'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',
130  'file_id': '17928620', 'default_target_attribute': 'class',
131  'row_id_attribute': 'MouseID',
132  'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
133  'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
134  'visibility': 'public', 'status': 'active',
135  'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}
136
137
138The ``DESCR`` contains a free-text description of the data, while ``details``
139contains a dictionary of meta-data stored by openml, like the dataset id.
140For more details, see the `OpenML documentation
141<https://docs.openml.org/#data>`_ The ``data_id`` of the mice protein dataset
142is 40966, and you can use this (or the name) to get more information on the
143dataset on the openml website::
144
145  >>> mice.url
146  'https://www.openml.org/d/40966'
147
148The ``data_id`` also uniquely identifies a dataset from OpenML::
149
150  >>> mice = fetch_openml(data_id=40966)
151  >>> mice.details # doctest: +SKIP
152  {'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
153  'creator': ...,
154  'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':
155  'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id':
156  '1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,
157  Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins
158  Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):
159  e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14',
160  'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':
161  '3c479a6885bfa0438971388283a1ce32'}
162
163.. _openml_versions:
164
165Dataset Versions
166~~~~~~~~~~~~~~~~
167
168A dataset is uniquely specified by its ``data_id``, but not necessarily by its
169name. Several different "versions" of a dataset with the same name can exist
170which can contain entirely different datasets.
171If a particular version of a dataset has been found to contain significant
172issues, it might be deactivated. Using a name to specify a dataset will yield
173the earliest version of a dataset that is still active. That means that
174``fetch_openml(name="miceprotein")`` can yield different results at different
175times if earlier versions become inactive.
176You can see that the dataset with ``data_id`` 40966 that we fetched above is
177the first version of the "miceprotein" dataset::
178
179  >>> mice.details['version']  #doctest: +SKIP
180  '1'
181
182In fact, this dataset only has one version. The iris dataset on the other hand
183has multiple versions::
184
185  >>> iris = fetch_openml(name="iris")
186  >>> iris.details['version']  #doctest: +SKIP
187  '1'
188  >>> iris.details['id']  #doctest: +SKIP
189  '61'
190
191  >>> iris_61 = fetch_openml(data_id=61)
192  >>> iris_61.details['version']
193  '1'
194  >>> iris_61.details['id']
195  '61'
196
197  >>> iris_969 = fetch_openml(data_id=969)
198  >>> iris_969.details['version']
199  '3'
200  >>> iris_969.details['id']
201  '969'
202
203Specifying the dataset by the name "iris" yields the lowest version, version 1,
204with the ``data_id`` 61. To make sure you always get this exact dataset, it is
205safest to specify it by the dataset ``data_id``. The other dataset, with
206``data_id`` 969, is version 3 (version 2 has become inactive), and contains a
207binarized version of the data::
208
209  >>> np.unique(iris_969.target)
210  array(['N', 'P'], dtype=object)
211
212You can also specify both the name and the version, which also uniquely
213identifies the dataset::
214
215  >>> iris_version_3 = fetch_openml(name="iris", version=3)
216  >>> iris_version_3.details['version']
217  '3'
218  >>> iris_version_3.details['id']
219  '969'
220
221
222.. topic:: References:
223
224 * Vanschoren, van Rijn, Bischl and Torgo
225   `"OpenML: networked science in machine learning"
226   <https://arxiv.org/pdf/1407.7722.pdf>`_,
227   ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.
228
229.. _external_datasets:
230
231Loading from external datasets
232------------------------------
233
234scikit-learn works on any numeric data stored as numpy arrays or scipy sparse
235matrices. Other types that are convertible to numeric arrays such as pandas
236DataFrame are also acceptable.
237
238Here are some recommended ways to load standard columnar data into a
239format usable by scikit-learn:
240
241* `pandas.io <https://pandas.pydata.org/pandas-docs/stable/io.html>`_
242  provides tools to read data from common formats including CSV, Excel, JSON
243  and SQL. DataFrames may also be constructed from lists of tuples or dicts.
244  Pandas handles heterogeneous data smoothly and provides tools for
245  manipulation and conversion into a numeric array suitable for scikit-learn.
246* `scipy.io <https://docs.scipy.org/doc/scipy/reference/io.html>`_
247  specializes in binary formats often used in scientific computing
248  context such as .mat and .arff
249* `numpy/routines.io <https://docs.scipy.org/doc/numpy/reference/routines.io.html>`_
250  for standard loading of columnar data into numpy arrays
251* scikit-learn's :func:`datasets.load_svmlight_file` for the svmlight or libSVM
252  sparse format
253* scikit-learn's :func:`datasets.load_files` for directories of text files where
254  the name of each directory is the name of each category and each file inside
255  of each directory corresponds to one sample from that category
256
257For some miscellaneous data such as images, videos, and audio, you may wish to
258refer to:
259
260* `skimage.io <https://scikit-image.org/docs/dev/api/skimage.io.html>`_ or
261  `Imageio <https://imageio.readthedocs.io/en/latest/userapi.html>`_
262  for loading images and videos into numpy arrays
263* `scipy.io.wavfile.read
264  <https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_
265  for reading WAV files into a numpy array
266
267Categorical (or nominal) features stored as strings (common in pandas DataFrames)
268will need converting to numerical features using :class:`~sklearn.preprocessing.OneHotEncoder`
269or :class:`~sklearn.preprocessing.OrdinalEncoder` or similar.
270See :ref:`preprocessing`.
271
272Note: if you manage your own numerical data it is recommended to use an
273optimized file format such as HDF5 to reduce data load times. Various libraries
274such as H5Py, PyTables and pandas provides a Python interface for reading and
275writing data in that format.
276