1---
2layout: global
3title: Clustering
4displayTitle: Clustering
5---
6
7This page describes clustering algorithms in MLlib.
8The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
9about these algorithms.
10
11**Table of Contents**
12
13* This will become a table of contents (this text will be scraped).
14{:toc}
15
16## K-means
17
18[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
19most commonly used clustering algorithms that clusters the data points into a
20predefined number of clusters. The MLlib implementation includes a parallelized
21variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
22called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
23
24`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.
25
26### Input Columns
27
28<table class="table">
29  <thead>
30    <tr>
31      <th align="left">Param name</th>
32      <th align="left">Type(s)</th>
33      <th align="left">Default</th>
34      <th align="left">Description</th>
35    </tr>
36  </thead>
37  <tbody>
38    <tr>
39      <td>featuresCol</td>
40      <td>Vector</td>
41      <td>"features"</td>
42      <td>Feature vector</td>
43    </tr>
44  </tbody>
45</table>
46
47### Output Columns
48
49<table class="table">
50  <thead>
51    <tr>
52      <th align="left">Param name</th>
53      <th align="left">Type(s)</th>
54      <th align="left">Default</th>
55      <th align="left">Description</th>
56    </tr>
57  </thead>
58  <tbody>
59    <tr>
60      <td>predictionCol</td>
61      <td>Int</td>
62      <td>"prediction"</td>
63      <td>Predicted cluster center</td>
64    </tr>
65  </tbody>
66</table>
67
68### Example
69
70<div class="codetabs">
71
72<div data-lang="scala" markdown="1">
73Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.
74
75{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
76</div>
77
78<div data-lang="java" markdown="1">
79Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
80
81{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
82</div>
83
84<div data-lang="python" markdown="1">
85Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.
86
87{% include_example python/ml/kmeans_example.py %}
88</div>
89
90<div data-lang="r" markdown="1">
91
92Refer to the [R API docs](api/R/spark.kmeans.html) for more details.
93
94{% include_example r/ml/kmeans.R %}
95</div>
96
97</div>
98
99## Latent Dirichlet allocation (LDA)
100
101`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
102and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
103`EMLDAOptimizer` to a `DistributedLDAModel` if needed.
104
105<div class="codetabs">
106
107<div data-lang="scala" markdown="1">
108
109Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.
110
111{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
112</div>
113
114<div data-lang="java" markdown="1">
115
116Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.
117
118{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
119</div>
120
121<div data-lang="python" markdown="1">
122
123Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.
124
125{% include_example python/ml/lda_example.py %}
126</div>
127
128<div data-lang="r" markdown="1">
129
130Refer to the [R API docs](api/R/spark.lda.html) for more details.
131
132{% include_example r/ml/lda.R %}
133</div>
134
135</div>
136
137## Bisecting k-means
138
139Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
140divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
141moves down the hierarchy.
142
143Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
144
145`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.
146
147### Example
148
149<div class="codetabs">
150
151<div data-lang="scala" markdown="1">
152Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details.
153
154{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
155</div>
156
157<div data-lang="java" markdown="1">
158Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
159
160{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
161</div>
162
163<div data-lang="python" markdown="1">
164Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.
165
166{% include_example python/ml/bisecting_k_means_example.py %}
167</div>
168</div>
169
170## Gaussian Mixture Model (GMM)
171
172A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
173represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
174each with its own probability. The `spark.ml` implementation uses the
175[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
176algorithm to induce the maximum-likelihood model given a set of samples.
177
178`GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
179model.
180
181### Input Columns
182
183<table class="table">
184  <thead>
185    <tr>
186      <th align="left">Param name</th>
187      <th align="left">Type(s)</th>
188      <th align="left">Default</th>
189      <th align="left">Description</th>
190    </tr>
191  </thead>
192  <tbody>
193    <tr>
194      <td>featuresCol</td>
195      <td>Vector</td>
196      <td>"features"</td>
197      <td>Feature vector</td>
198    </tr>
199  </tbody>
200</table>
201
202### Output Columns
203
204<table class="table">
205  <thead>
206    <tr>
207      <th align="left">Param name</th>
208      <th align="left">Type(s)</th>
209      <th align="left">Default</th>
210      <th align="left">Description</th>
211    </tr>
212  </thead>
213  <tbody>
214    <tr>
215      <td>predictionCol</td>
216      <td>Int</td>
217      <td>"prediction"</td>
218      <td>Predicted cluster center</td>
219    </tr>
220    <tr>
221      <td>probabilityCol</td>
222      <td>Vector</td>
223      <td>"probability"</td>
224      <td>Probability of each cluster</td>
225    </tr>
226  </tbody>
227</table>
228
229### Example
230
231<div class="codetabs">
232
233<div data-lang="scala" markdown="1">
234Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details.
235
236{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
237</div>
238
239<div data-lang="java" markdown="1">
240Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
241
242{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
243</div>
244
245<div data-lang="python" markdown="1">
246Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details.
247
248{% include_example python/ml/gaussian_mixture_example.py %}
249</div>
250
251<div data-lang="r" markdown="1">
252
253Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details.
254
255{% include_example r/ml/gaussianMixture.R %}
256</div>
257
258</div>
259