Introduction
Clustering algorithms are a part of unsupervised machine learning algorithms. As there is no target variable, the model is trained using input variables to discover intrinsic groups or clusters.
Because we don’t have labels for the data, these groups are formed based on similarity between data points. This tutorial covers clustering concepts, techniques, and applications across domains like healthcare, retail, and manufacturing.
We’ll also walk through examples in R, using real-world data from a water treatment plant to apply our knowledge practically.
Table of Contents
- Types of Clustering Techniques
- Distance Calculation for Clustering
- K-Means Clustering
- Choosing the Best K in K-Means
- Hierarchical Clustering
- Evaluation Methods in Cluster Analysis
- Clustering in R – Water Treatment Plants
Types of Clustering Techniques
Common clustering algorithms include K-Means, Fuzzy C-Means, and Hierarchical Clustering. Depending on the data type (numeric, categorical, mixed), the algorithm may vary. Clustering techniques can be classified as:
- Soft Clustering – Observations are assigned to clusters with probabilities.
- Hard Clustering – Observations belong to only one cluster.
We’ll focus on K-Means and Hierarchical Clustering in this guide.
Distance Calculation for Clustering
Distance metrics are used to measure similarity between data points. Common metrics include:
- Euclidean Distance – Suitable for numeric variables.
- Manhattan Distance – Measures horizontal and vertical distances.
- Hamming Distance – Used for categorical variables.
- Gower Distance – Handles mixed variable types.
- Cosine Similarity – Common in text analysis.
K-Means Clustering
K-Means partitions data into k non-overlapping clusters. The process includes:
- Randomly assign k centroids.
- Assign observations to the nearest centroid.
- Recalculate centroids.
- Repeat until convergence.
Clustering minimizes within-cluster variation using squared Euclidean distance.
Choosing the Best K in K-Means
Methods for selecting k include:
- Cross Validation
- Elbow Method
- Silhouette Method
- X-Means Clustering
Hierarchical Clustering
This method creates a nested sequence of clusters using two approaches:
- Agglomerative – Bottom-up merging.
- Divisive – Top-down splitting.
Dendrograms visualize the clustering hierarchy. A horizontal cut across the dendrogram reveals the number of clusters.
Evaluation Methods
Clustering evaluation is divided into:
- Internal Measures – Based on compactness and separation (e.g., SSE, Scatter Criteria).
- External Measures – Based on known labels (e.g., Rand Index, Precision-Recall).
Clustering in R – Water Treatment Plants
The water treatment dataset from the UCI repository is used to demonstrate hierarchical and k-means clustering.
# Load and preprocess data
library(data.table)
library(ggplot2)
library(fpc)
water_data <- read.table("water-treatment.data.txt", sep = ",", header = F, na.strings = "?")
setDT(water_data)
# Impute missing values
for(i in colnames(water_data)[-1]) {
set(water_data, which(is.na(water_data[[i]])), i, median(water_data[[i]], na.rm = TRUE))
}
# Scale numeric features
scaled_wd <- scale(water_data[,-1, with = FALSE])
Next, hierarchical clustering is performed using Euclidean distance and Ward's method. A dendrogram is plotted, and clusters are determined via horizontal cuts. PCA is used to visualize clusters.
# Hierarchical clustering
d <- dist(scaled_wd, method = "euclidean")
h_clust <- hclust(d, method = "ward.D2")
plot(h_clust, labels = water_data$V1)
# Cut dendrogram
rect.hclust(h_clust, k = 4)
groups <- cutree(h_clust, k = 4)
Principal components are used for cluster visualization:
# PCA for visualization
pcmp <- princomp(scaled_wd)
pred_pc <- predict(pcmp)[,1:2]
comp_dt <- cbind(as.data.table(pred_pc), cluster = as.factor(groups), Labels = water_data$V1)
ggplot(comp_dt, aes(Comp.1, Comp.2)) +
geom_point(aes(color = cluster), size = 3)
Then, k-means clustering is applied and similarly visualized using PCA components. The clustering consistency is confirmed visually.
# K-means clustering
kclust <- kmeans(scaled_wd, centers = 4, iter.max = 100)
ggplot(comp_dt, aes(Comp.1, Comp.2)) +
geom_point(aes(color = as.factor(kclust$cluster)), size = 3)