1.. _random_projection: 2 3================== 4Random Projection 5================== 6.. currentmodule:: sklearn.random_projection 7 8The :mod:`sklearn.random_projection` module implements a simple and 9computationally efficient way to reduce the dimensionality of the data by 10trading a controlled amount of accuracy (as additional variance) for faster 11processing times and smaller model sizes. This module implements two types of 12unstructured random matrix: 13:ref:`Gaussian random matrix <gaussian_random_matrix>` and 14:ref:`sparse random matrix <sparse_random_matrix>`. 15 16The dimensions and distribution of random projections matrices are 17controlled so as to preserve the pairwise distances between any two 18samples of the dataset. Thus random projection is a suitable approximation 19technique for distance based method. 20 21 22.. topic:: References: 23 24 * Sanjoy Dasgupta. 2000. 25 `Experiments with random projection. <https://cseweb.ucsd.edu/~dasgupta/papers/randomf.pdf>`_ 26 In Proceedings of the Sixteenth conference on Uncertainty in artificial 27 intelligence (UAI'00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan 28 Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151. 29 30 * Ella Bingham and Heikki Mannila. 2001. 31 `Random projection in dimensionality reduction: applications to image and text data. <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5135&rep=rep1&type=pdf>`_ 32 In Proceedings of the seventh ACM SIGKDD international conference on 33 Knowledge discovery and data mining (KDD '01). ACM, New York, NY, USA, 34 245-250. 35 36 37.. _johnson_lindenstrauss: 38 39The Johnson-Lindenstrauss lemma 40=============================== 41 42The main theoretical result behind the efficiency of random projection is the 43`Johnson-Lindenstrauss lemma (quoting Wikipedia) 44<https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_: 45 46 In mathematics, the Johnson-Lindenstrauss lemma is a result 47 concerning low-distortion embeddings of points from high-dimensional 48 into low-dimensional Euclidean space. The lemma states that a small set 49 of points in a high-dimensional space can be embedded into a space of 50 much lower dimension in such a way that distances between the points are 51 nearly preserved. The map used for the embedding is at least Lipschitz, 52 and can even be taken to be an orthogonal projection. 53 54Knowing only the number of samples, the 55:func:`johnson_lindenstrauss_min_dim` estimates 56conservatively the minimal size of the random subspace to guarantee a 57bounded distortion introduced by the random projection:: 58 59 >>> from sklearn.random_projection import johnson_lindenstrauss_min_dim 60 >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5) 61 663 62 >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01]) 63 array([ 663, 11841, 1112658]) 64 >>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1) 65 array([ 7894, 9868, 11841]) 66 67.. figure:: ../auto_examples/miscellaneous/images/sphx_glr_plot_johnson_lindenstrauss_bound_001.png 68 :target: ../auto_examples/miscellaneous/plot_johnson_lindenstrauss_bound.html 69 :scale: 75 70 :align: center 71 72.. figure:: ../auto_examples/miscellaneous/images/sphx_glr_plot_johnson_lindenstrauss_bound_002.png 73 :target: ../auto_examples/miscellaneous/plot_johnson_lindenstrauss_bound.html 74 :scale: 75 75 :align: center 76 77.. topic:: Example: 78 79 * See :ref:`sphx_glr_auto_examples_miscellaneous_plot_johnson_lindenstrauss_bound.py` 80 for a theoretical explication on the Johnson-Lindenstrauss lemma and an 81 empirical validation using sparse random matrices. 82 83.. topic:: References: 84 85 * Sanjoy Dasgupta and Anupam Gupta, 1999. 86 `An elementary proof of the Johnson-Lindenstrauss Lemma. 87 <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.3334&rep=rep1&type=pdf>`_ 88 89.. _gaussian_random_matrix: 90 91Gaussian random projection 92========================== 93The :class:`GaussianRandomProjection` reduces the 94dimensionality by projecting the original input space on a randomly generated 95matrix where components are drawn from the following distribution 96:math:`N(0, \frac{1}{n_{components}})`. 97 98Here a small excerpt which illustrates how to use the Gaussian random 99projection transformer:: 100 101 >>> import numpy as np 102 >>> from sklearn import random_projection 103 >>> X = np.random.rand(100, 10000) 104 >>> transformer = random_projection.GaussianRandomProjection() 105 >>> X_new = transformer.fit_transform(X) 106 >>> X_new.shape 107 (100, 3947) 108 109 110.. _sparse_random_matrix: 111 112Sparse random projection 113======================== 114The :class:`SparseRandomProjection` reduces the 115dimensionality by projecting the original input space using a sparse 116random matrix. 117 118Sparse random matrices are an alternative to dense Gaussian random 119projection matrix that guarantees similar embedding quality while being much 120more memory efficient and allowing faster computation of the projected data. 121 122If we define ``s = 1 / density``, the elements of the random matrix 123are drawn from 124 125.. math:: 126 127 \left\{ 128 \begin{array}{c c l} 129 -\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\ 130 0 &\text{with probability} & 1 - 1 / s \\ 131 +\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\ 132 \end{array} 133 \right. 134 135where :math:`n_{\text{components}}` is the size of the projected subspace. 136By default the density of non zero elements is set to the minimum density as 137recommended by Ping Li et al.: :math:`1 / \sqrt{n_{\text{features}}}`. 138 139Here a small excerpt which illustrates how to use the sparse random 140projection transformer:: 141 142 >>> import numpy as np 143 >>> from sklearn import random_projection 144 >>> X = np.random.rand(100, 10000) 145 >>> transformer = random_projection.SparseRandomProjection() 146 >>> X_new = transformer.fit_transform(X) 147 >>> X_new.shape 148 (100, 3947) 149 150 151.. topic:: References: 152 153 * D. Achlioptas. 2003. 154 `Database-friendly random projections: Johnson-Lindenstrauss with binary 155 coins <http://www.cs.ucsc.edu/~optas/papers/jl.pdf>`_. 156 Journal of Computer and System Sciences 66 (2003) 671–687 157 158 * Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. 159 `Very sparse random projections. <https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf>`_ 160 In Proceedings of the 12th ACM SIGKDD international conference on 161 Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA, 162 287-296. 163