1.. Places parent toc into the sidebar 2 3:parenttoc: True 4 5.. _loading_other_datasets: 6 7Loading other datasets 8====================== 9 10.. currentmodule:: sklearn.datasets 11 12.. _sample_images: 13 14Sample images 15------------- 16 17Scikit-learn also embeds a couple of sample JPEG images published under Creative 18Commons license by their authors. Those images can be useful to test algorithms 19and pipelines on 2D data. 20 21.. autosummary:: 22 23 load_sample_images 24 load_sample_image 25 26.. image:: ../auto_examples/cluster/images/sphx_glr_plot_color_quantization_001.png 27 :target: ../auto_examples/cluster/plot_color_quantization.html 28 :scale: 30 29 :align: right 30 31 32.. warning:: 33 34 The default coding of images is based on the ``uint8`` dtype to 35 spare memory. Often machine learning algorithms work best if the 36 input is converted to a floating point representation first. Also, 37 if you plan to use ``matplotlib.pyplpt.imshow``, don't forget to scale to the range 38 0 - 1 as done in the following example. 39 40.. topic:: Examples: 41 42 * :ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py` 43 44.. _libsvm_loader: 45 46Datasets in svmlight / libsvm format 47------------------------------------ 48 49scikit-learn includes utility functions for loading 50datasets in the svmlight / libsvm format. In this format, each line 51takes the form ``<label> <feature-id>:<feature-value> 52<feature-id>:<feature-value> ...``. This format is especially suitable for sparse datasets. 53In this module, scipy sparse CSR matrices are used for ``X`` and numpy arrays are used for ``y``. 54 55You may load a dataset like as follows:: 56 57 >>> from sklearn.datasets import load_svmlight_file 58 >>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt") 59 ... # doctest: +SKIP 60 61You may also load two (or more) datasets at once:: 62 63 >>> X_train, y_train, X_test, y_test = load_svmlight_files( 64 ... ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt")) 65 ... # doctest: +SKIP 66 67In this case, ``X_train`` and ``X_test`` are guaranteed to have the same number 68of features. Another way to achieve the same result is to fix the number of 69features:: 70 71 >>> X_test, y_test = load_svmlight_file( 72 ... "/path/to/test_dataset.txt", n_features=X_train.shape[1]) 73 ... # doctest: +SKIP 74 75.. topic:: Related links: 76 77 _`Public datasets in svmlight / libsvm format`: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets 78 79 _`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader 80 81.. 82 For doctests: 83 84 >>> import numpy as np 85 >>> import os 86 87.. _openml: 88 89Downloading datasets from the openml.org repository 90--------------------------------------------------- 91 92`openml.org <https://openml.org>`_ is a public repository for machine learning 93data and experiments, that allows everybody to upload open datasets. 94 95The ``sklearn.datasets`` package is able to download datasets 96from the repository using the function 97:func:`sklearn.datasets.fetch_openml`. 98 99For example, to download a dataset of gene expressions in mice brains:: 100 101 >>> from sklearn.datasets import fetch_openml 102 >>> mice = fetch_openml(name='miceprotein', version=4) 103 104To fully specify a dataset, you need to provide a name and a version, though 105the version is optional, see :ref:`openml_versions` below. 106The dataset contains a total of 1080 examples belonging to 8 different 107classes:: 108 109 >>> mice.data.shape 110 (1080, 77) 111 >>> mice.target.shape 112 (1080,) 113 >>> np.unique(mice.target) 114 array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object) 115 116You can get more information on the dataset by looking at the ``DESCR`` 117and ``details`` attributes:: 118 119 >>> print(mice.DESCR) # doctest: +SKIP 120 **Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios 121 **Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015 122 **Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing 123 Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down 124 Syndrome. PLoS ONE 10(6): e0129126... 125 126 >>> mice.details # doctest: +SKIP 127 {'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF', 128 'upload_date': '2017-11-08T16:00:15', 'licence': 'Public', 129 'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff', 130 'file_id': '17928620', 'default_target_attribute': 'class', 131 'row_id_attribute': 'MouseID', 132 'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'], 133 'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'], 134 'visibility': 'public', 'status': 'active', 135 'md5_checksum': '3c479a6885bfa0438971388283a1ce32'} 136 137 138The ``DESCR`` contains a free-text description of the data, while ``details`` 139contains a dictionary of meta-data stored by openml, like the dataset id. 140For more details, see the `OpenML documentation 141<https://docs.openml.org/#data>`_ The ``data_id`` of the mice protein dataset 142is 40966, and you can use this (or the name) to get more information on the 143dataset on the openml website:: 144 145 >>> mice.url 146 'https://www.openml.org/d/40966' 147 148The ``data_id`` also uniquely identifies a dataset from OpenML:: 149 150 >>> mice = fetch_openml(data_id=40966) 151 >>> mice.details # doctest: +SKIP 152 {'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF', 153 'creator': ..., 154 'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url': 155 'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id': 156 '1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C, 157 Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins 158 Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): 159 e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14', 160 'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum': 161 '3c479a6885bfa0438971388283a1ce32'} 162 163.. _openml_versions: 164 165Dataset Versions 166~~~~~~~~~~~~~~~~ 167 168A dataset is uniquely specified by its ``data_id``, but not necessarily by its 169name. Several different "versions" of a dataset with the same name can exist 170which can contain entirely different datasets. 171If a particular version of a dataset has been found to contain significant 172issues, it might be deactivated. Using a name to specify a dataset will yield 173the earliest version of a dataset that is still active. That means that 174``fetch_openml(name="miceprotein")`` can yield different results at different 175times if earlier versions become inactive. 176You can see that the dataset with ``data_id`` 40966 that we fetched above is 177the first version of the "miceprotein" dataset:: 178 179 >>> mice.details['version'] #doctest: +SKIP 180 '1' 181 182In fact, this dataset only has one version. The iris dataset on the other hand 183has multiple versions:: 184 185 >>> iris = fetch_openml(name="iris") 186 >>> iris.details['version'] #doctest: +SKIP 187 '1' 188 >>> iris.details['id'] #doctest: +SKIP 189 '61' 190 191 >>> iris_61 = fetch_openml(data_id=61) 192 >>> iris_61.details['version'] 193 '1' 194 >>> iris_61.details['id'] 195 '61' 196 197 >>> iris_969 = fetch_openml(data_id=969) 198 >>> iris_969.details['version'] 199 '3' 200 >>> iris_969.details['id'] 201 '969' 202 203Specifying the dataset by the name "iris" yields the lowest version, version 1, 204with the ``data_id`` 61. To make sure you always get this exact dataset, it is 205safest to specify it by the dataset ``data_id``. The other dataset, with 206``data_id`` 969, is version 3 (version 2 has become inactive), and contains a 207binarized version of the data:: 208 209 >>> np.unique(iris_969.target) 210 array(['N', 'P'], dtype=object) 211 212You can also specify both the name and the version, which also uniquely 213identifies the dataset:: 214 215 >>> iris_version_3 = fetch_openml(name="iris", version=3) 216 >>> iris_version_3.details['version'] 217 '3' 218 >>> iris_version_3.details['id'] 219 '969' 220 221 222.. topic:: References: 223 224 * Vanschoren, van Rijn, Bischl and Torgo 225 `"OpenML: networked science in machine learning" 226 <https://arxiv.org/pdf/1407.7722.pdf>`_, 227 ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014. 228 229.. _external_datasets: 230 231Loading from external datasets 232------------------------------ 233 234scikit-learn works on any numeric data stored as numpy arrays or scipy sparse 235matrices. Other types that are convertible to numeric arrays such as pandas 236DataFrame are also acceptable. 237 238Here are some recommended ways to load standard columnar data into a 239format usable by scikit-learn: 240 241* `pandas.io <https://pandas.pydata.org/pandas-docs/stable/io.html>`_ 242 provides tools to read data from common formats including CSV, Excel, JSON 243 and SQL. DataFrames may also be constructed from lists of tuples or dicts. 244 Pandas handles heterogeneous data smoothly and provides tools for 245 manipulation and conversion into a numeric array suitable for scikit-learn. 246* `scipy.io <https://docs.scipy.org/doc/scipy/reference/io.html>`_ 247 specializes in binary formats often used in scientific computing 248 context such as .mat and .arff 249* `numpy/routines.io <https://docs.scipy.org/doc/numpy/reference/routines.io.html>`_ 250 for standard loading of columnar data into numpy arrays 251* scikit-learn's :func:`datasets.load_svmlight_file` for the svmlight or libSVM 252 sparse format 253* scikit-learn's :func:`datasets.load_files` for directories of text files where 254 the name of each directory is the name of each category and each file inside 255 of each directory corresponds to one sample from that category 256 257For some miscellaneous data such as images, videos, and audio, you may wish to 258refer to: 259 260* `skimage.io <https://scikit-image.org/docs/dev/api/skimage.io.html>`_ or 261 `Imageio <https://imageio.readthedocs.io/en/latest/userapi.html>`_ 262 for loading images and videos into numpy arrays 263* `scipy.io.wavfile.read 264 <https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_ 265 for reading WAV files into a numpy array 266 267Categorical (or nominal) features stored as strings (common in pandas DataFrames) 268will need converting to numerical features using :class:`~sklearn.preprocessing.OneHotEncoder` 269or :class:`~sklearn.preprocessing.OrdinalEncoder` or similar. 270See :ref:`preprocessing`. 271 272Note: if you manage your own numerical data it is recommended to use an 273optimized file format such as HDF5 to reduce data load times. Various libraries 274such as H5Py, PyTables and pandas provides a Python interface for reading and 275writing data in that format. 276