Demystifying ASAP: Evaluating your K-means Clustering Model

Overview of Clustering

Clustering techniques are a type of unsupervised machine learning algorithm. They are used, as the name suggests, to group similar data points together. The way grouping works is the distance between the different groups together is high and the distance between each data point inside the cluster is low. This is known as intra-class (the different groups) and inter-class (within each group).

Clustering techniques determine similarity through distance which can be measured through a number of means, such as Manhattan, Euclidean, or Minkowski.

There are many types of clustering techniques that you can employ. The distinction I want to make in this blog is between hierarchical and nonhierarchical algorithms.

Looking into hierarchical algorithms, there are 2 types that you can use. Agglomerative and divisive. The difference between these is that the former starts with k clusters where k is the number of observations and the algorithm begins to combine the 2 most similar clusters together and the latter starts with one giant cluster and begins to divide towards x clusters.

A nonhierarchical algorithm requires you to choose the initial clusters. The center point of each cluster is selected at random and the centers begin moving until the optimal points are reached for the k clusters.

A closer look at K-means

K-Means Clustering is a type of nonhierarchical algorithm. This means we need to decide on the number of clusters we want before we employ this technique. There are many scenarios where this is a hard decision as one wouldn’t know how many clusters there should be in your data. To help you evaluate how many clusters, or the value of k, to settle on, there are many metrics you can look at before finalizing your decision. Among them are the: Calinski Harabasz Score, the Elbow Plot, and the Silhouette Score.

Calinski Harabasz Score

The Calinski-Harabasz score evaluates a cluster based on the average intra and inter cluster sum of squares. The higher the score, the better the clustering is for your data. This rings true with what we talked about earlier where you want to maximize the distance between different clusters and minimize the distance between data points within each cluster.

Silhouette Score

The silhouette score is calculated based on the mean intra-cluster distance and the mean nearest-cluster distance for each sample. The values range from 1 to -1 where the higher the value the better the clustering fit is.

Elbow Plot

An elbow plot is a general term which essentially plots a metric on the y-axis and a value of k on the x-axis. The purpose of an elbow plot is to determine the point of diminishing returns of naturally occurring clusters in a data set. The value of k is determined by the point where the plot bends like an arm.

Conclusion

Determining the optimal number of clusters in your data depends on the circumstances. There is no right answer to fit 100% of the problems. However, with these 3 metrics, you will have more information in determining what number of clusters makes the most sense for you, your data, and the story you are trying to tell.

Demystifying ASAP: Evaluating your K-means Clustering Model

Overview of Clustering

A closer look at K-means

Calinski Harabasz Score

Silhouette Score

Elbow Plot

Conclusion

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Scanmatik 2 SM2 clone diver v2.21.22 free no pass

Who’s been sentenced from Corby, Kettering, Ringstead, Rothwell, Rushden,...

Windows Server の Essentials エディションは、ドメインのメンバーサーバーとして利用できません。

Anthony Wahome Biography, Family, Wife and Children

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

Best 5 Happy Mothers Day Poems For Step Mother

Materials Around Us Class 6 Worksheet Science Chapter 6

Practice Sheet of Right form of verbs for HSC Students

I Offer a Relaxing Swedish Massage for adult males and females of all ages. :...

Drug dealing brothers caught with £74k stash in Newtown Linford home

Hull man, 27, dies after crashing car into a tree on the A165 near Brandesburton

Police charge man, 23, with assault and criminal damage following incident in...

Notification of Pre-Mature Increment to All the Upgraded Employees since...

Brunei reaffirms healthcare commitment

(Notes & Audio) The 26 Promises of Allah to the Ummah

Kalank - Malayalam (1CD ) - subtitles

Skint TV teen to be sentenced

Kanulanu Thaake Lyrics and translation | Manam (2014)

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status