🪴Digital Garden

Search

❯

artificial intelligence

❯

machine learning

❯

❯

K-means Clustering

K-means Clustering

Jul 19, 20241 min read

Algorithm

select k clusters
initialize a random k points as cluster means
- randomize, or
- k uniformly distributed within the range of distribution, or
- perform k-means on a subset of data and use the result as the initial means
assign each point to the nearest mean
recompute the mean of each cluster
repeat till convergence
- when the means stop moving, or
- change in means < some threshold

How to pick K?

Elbow method: Plot the cost (sum of squared distances from each point to its cluster centroid) for different values of k. The optimal k is the “elbow” point where the cost starts decreasing linearly
Silhouette Analysis: measures how well it fits into its assigned cluster compared to other clusters
Calinski-Harabasz Index: ratio of the between-cluster dispersion and the within-cluster dispersion. Scales better, as each pair distance is not needed.
Davies-Bouldin Index: measures the average similarity between each cluster and its most similar one. Scales better, as each pair distance is not needed.
Dunn Index: calculates the ratio of the smallest inter-cluster distance to the largest intra-cluster distance

Cons

need to pick k
sensitive to initialization
sensitive to outliers

Scale

MiniBatchKMeans

Graph View

Algorithm
How to pick K?
Cons
Scale

Backlinks

Image Segmentation
Clustering

Created with Quartz v4.2.3 © 2024

GitHub
Discord Community