1--- 2layout: global 3title: Clustering 4displayTitle: Clustering 5--- 6 7This page describes clustering algorithms in MLlib. 8The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information 9about these algorithms. 10 11**Table of Contents** 12 13* This will become a table of contents (this text will be scraped). 14{:toc} 15 16## K-means 17 18[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the 19most commonly used clustering algorithms that clusters the data points into a 20predefined number of clusters. The MLlib implementation includes a parallelized 21variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method 22called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). 23 24`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. 25 26### Input Columns 27 28<table class="table"> 29 <thead> 30 <tr> 31 <th align="left">Param name</th> 32 <th align="left">Type(s)</th> 33 <th align="left">Default</th> 34 <th align="left">Description</th> 35 </tr> 36 </thead> 37 <tbody> 38 <tr> 39 <td>featuresCol</td> 40 <td>Vector</td> 41 <td>"features"</td> 42 <td>Feature vector</td> 43 </tr> 44 </tbody> 45</table> 46 47### Output Columns 48 49<table class="table"> 50 <thead> 51 <tr> 52 <th align="left">Param name</th> 53 <th align="left">Type(s)</th> 54 <th align="left">Default</th> 55 <th align="left">Description</th> 56 </tr> 57 </thead> 58 <tbody> 59 <tr> 60 <td>predictionCol</td> 61 <td>Int</td> 62 <td>"prediction"</td> 63 <td>Predicted cluster center</td> 64 </tr> 65 </tbody> 66</table> 67 68### Example 69 70<div class="codetabs"> 71 72<div data-lang="scala" markdown="1"> 73Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. 74 75{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} 76</div> 77 78<div data-lang="java" markdown="1"> 79Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. 80 81{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %} 82</div> 83 84<div data-lang="python" markdown="1"> 85Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details. 86 87{% include_example python/ml/kmeans_example.py %} 88</div> 89 90<div data-lang="r" markdown="1"> 91 92Refer to the [R API docs](api/R/spark.kmeans.html) for more details. 93 94{% include_example r/ml/kmeans.R %} 95</div> 96 97</div> 98 99## Latent Dirichlet allocation (LDA) 100 101`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, 102and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by 103`EMLDAOptimizer` to a `DistributedLDAModel` if needed. 104 105<div class="codetabs"> 106 107<div data-lang="scala" markdown="1"> 108 109Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details. 110 111{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %} 112</div> 113 114<div data-lang="java" markdown="1"> 115 116Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details. 117 118{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %} 119</div> 120 121<div data-lang="python" markdown="1"> 122 123Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details. 124 125{% include_example python/ml/lda_example.py %} 126</div> 127 128<div data-lang="r" markdown="1"> 129 130Refer to the [R API docs](api/R/spark.lda.html) for more details. 131 132{% include_example r/ml/lda.R %} 133</div> 134 135</div> 136 137## Bisecting k-means 138 139Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a 140divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one 141moves down the hierarchy. 142 143Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. 144 145`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model. 146 147### Example 148 149<div class="codetabs"> 150 151<div data-lang="scala" markdown="1"> 152Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details. 153 154{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %} 155</div> 156 157<div data-lang="java" markdown="1"> 158Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details. 159 160{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %} 161</div> 162 163<div data-lang="python" markdown="1"> 164Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details. 165 166{% include_example python/ml/bisecting_k_means_example.py %} 167</div> 168</div> 169 170## Gaussian Mixture Model (GMM) 171 172A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) 173represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions, 174each with its own probability. The `spark.ml` implementation uses the 175[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) 176algorithm to induce the maximum-likelihood model given a set of samples. 177 178`GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base 179model. 180 181### Input Columns 182 183<table class="table"> 184 <thead> 185 <tr> 186 <th align="left">Param name</th> 187 <th align="left">Type(s)</th> 188 <th align="left">Default</th> 189 <th align="left">Description</th> 190 </tr> 191 </thead> 192 <tbody> 193 <tr> 194 <td>featuresCol</td> 195 <td>Vector</td> 196 <td>"features"</td> 197 <td>Feature vector</td> 198 </tr> 199 </tbody> 200</table> 201 202### Output Columns 203 204<table class="table"> 205 <thead> 206 <tr> 207 <th align="left">Param name</th> 208 <th align="left">Type(s)</th> 209 <th align="left">Default</th> 210 <th align="left">Description</th> 211 </tr> 212 </thead> 213 <tbody> 214 <tr> 215 <td>predictionCol</td> 216 <td>Int</td> 217 <td>"prediction"</td> 218 <td>Predicted cluster center</td> 219 </tr> 220 <tr> 221 <td>probabilityCol</td> 222 <td>Vector</td> 223 <td>"probability"</td> 224 <td>Probability of each cluster</td> 225 </tr> 226 </tbody> 227</table> 228 229### Example 230 231<div class="codetabs"> 232 233<div data-lang="scala" markdown="1"> 234Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details. 235 236{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %} 237</div> 238 239<div data-lang="java" markdown="1"> 240Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details. 241 242{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %} 243</div> 244 245<div data-lang="python" markdown="1"> 246Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details. 247 248{% include_example python/ml/gaussian_mixture_example.py %} 249</div> 250 251<div data-lang="r" markdown="1"> 252 253Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. 254 255{% include_example r/ml/gaussianMixture.R %} 256</div> 257 258</div> 259