1.. _random_projection:
2
3==================
4Random Projection
5==================
6.. currentmodule:: sklearn.random_projection
7
8The :mod:`sklearn.random_projection` module implements a simple and
9computationally efficient way to reduce the dimensionality of the data by
10trading a controlled amount of accuracy (as additional variance) for faster
11processing times and smaller model sizes. This module implements two types of
12unstructured random matrix:
13:ref:`Gaussian random matrix <gaussian_random_matrix>` and
14:ref:`sparse random matrix <sparse_random_matrix>`.
15
16The dimensions and distribution of random projections matrices are
17controlled so as to preserve the pairwise distances between any two
18samples of the dataset. Thus random projection is a suitable approximation
19technique for distance based method.
20
21
22.. topic:: References:
23
24 * Sanjoy Dasgupta. 2000.
25   `Experiments with random projection. <https://cseweb.ucsd.edu/~dasgupta/papers/randomf.pdf>`_
26   In Proceedings of the Sixteenth conference on Uncertainty in artificial
27   intelligence (UAI'00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan
28   Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.
29
30 * Ella Bingham and Heikki Mannila. 2001.
31   `Random projection in dimensionality reduction: applications to image and text data. <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5135&rep=rep1&type=pdf>`_
32   In Proceedings of the seventh ACM SIGKDD international conference on
33   Knowledge discovery and data mining (KDD '01). ACM, New York, NY, USA,
34   245-250.
35
36
37.. _johnson_lindenstrauss:
38
39The Johnson-Lindenstrauss lemma
40===============================
41
42The main theoretical result behind the efficiency of random projection is the
43`Johnson-Lindenstrauss lemma (quoting Wikipedia)
44<https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_:
45
46  In mathematics, the Johnson-Lindenstrauss lemma is a result
47  concerning low-distortion embeddings of points from high-dimensional
48  into low-dimensional Euclidean space. The lemma states that a small set
49  of points in a high-dimensional space can be embedded into a space of
50  much lower dimension in such a way that distances between the points are
51  nearly preserved. The map used for the embedding is at least Lipschitz,
52  and can even be taken to be an orthogonal projection.
53
54Knowing only the number of samples, the
55:func:`johnson_lindenstrauss_min_dim` estimates
56conservatively the minimal size of the random subspace to guarantee a
57bounded distortion introduced by the random projection::
58
59  >>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
60  >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5)
61  663
62  >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01])
63  array([    663,   11841, 1112658])
64  >>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
65  array([ 7894,  9868, 11841])
66
67.. figure:: ../auto_examples/miscellaneous/images/sphx_glr_plot_johnson_lindenstrauss_bound_001.png
68   :target: ../auto_examples/miscellaneous/plot_johnson_lindenstrauss_bound.html
69   :scale: 75
70   :align: center
71
72.. figure:: ../auto_examples/miscellaneous/images/sphx_glr_plot_johnson_lindenstrauss_bound_002.png
73   :target: ../auto_examples/miscellaneous/plot_johnson_lindenstrauss_bound.html
74   :scale: 75
75   :align: center
76
77.. topic:: Example:
78
79  * See :ref:`sphx_glr_auto_examples_miscellaneous_plot_johnson_lindenstrauss_bound.py`
80    for a theoretical explication on the Johnson-Lindenstrauss lemma and an
81    empirical validation using sparse random matrices.
82
83.. topic:: References:
84
85  * Sanjoy Dasgupta and Anupam Gupta, 1999.
86    `An elementary proof of the Johnson-Lindenstrauss Lemma.
87    <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.3334&rep=rep1&type=pdf>`_
88
89.. _gaussian_random_matrix:
90
91Gaussian random projection
92==========================
93The :class:`GaussianRandomProjection` reduces the
94dimensionality by projecting the original input space on a randomly generated
95matrix where components are drawn from the following distribution
96:math:`N(0, \frac{1}{n_{components}})`.
97
98Here a small excerpt which illustrates how to use the Gaussian random
99projection transformer::
100
101  >>> import numpy as np
102  >>> from sklearn import random_projection
103  >>> X = np.random.rand(100, 10000)
104  >>> transformer = random_projection.GaussianRandomProjection()
105  >>> X_new = transformer.fit_transform(X)
106  >>> X_new.shape
107  (100, 3947)
108
109
110.. _sparse_random_matrix:
111
112Sparse random projection
113========================
114The :class:`SparseRandomProjection` reduces the
115dimensionality by projecting the original input space using a sparse
116random matrix.
117
118Sparse random matrices are an alternative to dense Gaussian random
119projection matrix that guarantees similar embedding quality while being much
120more memory efficient and allowing faster computation of the projected data.
121
122If we define ``s = 1 / density``, the elements of the random matrix
123are drawn from
124
125.. math::
126
127  \left\{
128  \begin{array}{c c l}
129  -\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\
130  0 &\text{with probability}  & 1 - 1 / s \\
131  +\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\
132  \end{array}
133  \right.
134
135where :math:`n_{\text{components}}` is the size of the projected subspace.
136By default the density of non zero elements is set to the minimum density as
137recommended by Ping Li et al.: :math:`1 / \sqrt{n_{\text{features}}}`.
138
139Here a small excerpt which illustrates how to use the sparse random
140projection transformer::
141
142  >>> import numpy as np
143  >>> from sklearn import random_projection
144  >>> X = np.random.rand(100, 10000)
145  >>> transformer = random_projection.SparseRandomProjection()
146  >>> X_new = transformer.fit_transform(X)
147  >>> X_new.shape
148  (100, 3947)
149
150
151.. topic:: References:
152
153 * D. Achlioptas. 2003.
154   `Database-friendly random projections: Johnson-Lindenstrauss  with binary
155   coins <http://www.cs.ucsc.edu/~optas/papers/jl.pdf>`_.
156   Journal of Computer and System Sciences 66 (2003) 671–687
157
158 * Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006.
159   `Very sparse random projections. <https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf>`_
160   In Proceedings of the 12th ACM SIGKDD international conference on
161   Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA,
162   287-296.
163