• 📖 Cover
  • Contents

Chapter 4: Clustering

Topic 4: Finding Hidden Groups — Clustering Methods

ISOM 2600 — Business Statistics with Python

Professor Xuhu Wan · HKUST Business School


From Supervised to Unsupervised Learning

In the first three chapters of this course you built a predictive machine: you gathered features \(X_1, X_2, \ldots, X_p\), attached a label \(Y\), and trained a model that maps inputs to outputs. That is supervised learning — the algorithm learns under the supervision of known answers.

This chapter marks a conceptual turning point. There is no \(Y\). No house-price target, no return to predict, no label telling you whether a customer will churn. Instead, the question becomes: are there natural groups hiding in the data? This is the domain of unsupervised learning, and clustering is its most widely deployed tool.

A Brief History

The mathematics of clustering has surprisingly deep roots. Euclidean distance — the straight-line metric that underlies almost every clustering algorithm — traces directly to ancient Greek geometry and Pythagoras’s theorem. But the computational story begins in earnest in the mid-20th century.

K-Means was invented by Stuart Lloyd at Bell Labs in 1957. Lloyd was working on signal quantization — how to compress a continuous signal into a small number of discrete levels while minimizing distortion. His iterative assign-and-update algorithm turned out to be a universal tool for partitioning any numerical data. The method stayed internal to Bell Labs for two decades before being published in 1982. In the meantime, James MacQueen rediscovered and named the method “K-Means” in 1967, and that name stuck. Today K-Means is arguably the most widely used clustering algorithm in industry, valued for its speed and interpretability.

Hierarchical clustering has a parallel lineage. Ward’s method — which minimizes within-cluster variance at each merge step and remains the recommended default — was published by Joe Ward in 1963. Stephen Johnson formalized the full agglomerative framework in 1967, the same year MacQueen named K-Means. The dendrogram visualization, which makes hierarchical clustering’s output so readable, was already standard in biological taxonomy long before statisticians adopted it.

Why Clustering Matters in Business

The business case for clustering is straightforward: heterogeneous populations require heterogeneous strategies. The phrase “the average customer” is usually meaningless — averages erase the variation that matters. A retailer whose customers range from bargain-hunters to luxury enthusiasts should not send the same promotion to everyone.

Customer segmentation is the canonical application. By grouping customers into clusters with similar behavioral and demographic profiles, a company can tailor its messaging, pricing, product assortment, and channel strategy to each segment. The impact can be dramatic: segment-specific campaigns routinely outperform generic mass marketing by 20–50% on conversion rates.

The technique extends far beyond marketing. Netflix groups its 200+ million subscribers into approximately 2,000 “taste clusters” based on viewing patterns; each cluster gets a different thumbnail, title, and recommendation sequence. Spotify’s Discover Weekly playlist — one of the company’s most-loved features — is built on clustering listeners by listening history and then surfacing tracks that similar listeners enjoyed. Banks use clustering for credit-risk segmentation: Basel III explicitly encourages banks to segment their retail portfolios into homogeneous risk groups for capital allocation. Hospitals cluster patient phenotypes to identify subgroups that respond differently to the same treatment, enabling personalized medicine at scale. HR departments cluster employees by performance and engagement metrics to identify flight-risk segments before they resign.

What You Will Learn — and Why It Matters for Your Career

This chapter covers two foundational clustering methods:

  1. K-Means — the workhorse of industry clustering; fast, interpretable, scalable
  2. Hierarchical Clustering — slower but richer; produces a dendrogram that reveals the full merging structure without requiring you to specify the number of groups in advance

Along the way you will master distance measures, standardization, the elbow method for choosing \(K\), and the art of interpreting cluster profiles. You will apply both methods to two real datasets: Forbes financial data (can the algorithm discover industries from financial ratios alone?) and mall customer data (can we segment shoppers into actionable groups?).

Course weight vs. industry leverage

Topic 4 carries 10% of the course grade — lighter than regression. But in industry, clustering is among the most frequently deployed analytics tools, precisely because it requires no labels and produces outputs that non-technical stakeholders can immediately understand and act on. The ability to segment customers, products, markets, or employees into coherent groups is a core data science skill that appears in virtually every analytics job description.

A conceptual shift from Topics 1–3: we are moving from supervised to unsupervised learning. No more \(Y\) variable — we discover structure in data without labels.

Outline

  1. Why Clustering?
  2. How Do We Measure Similarity?
  3. K-Means Clustering
  4. Hierarchical Clustering
  5. Case Study: Forbes Financial Data
  6. Case Study: Customer Segmentation
  7. Summary

Why Clustering?

Why Clustering? — Business Motivation

Clustering is everywhere in business:

  • 💻 Netflix segments viewers into taste groups to recommend movies
  • 💻 Spotify creates playlist groups from listening behaviour
  • 💻 Banks segment credit-risk customers for pricing and marketing
  • 💻 Retailers group products by purchasing patterns

Imagine a scatter plot with three clearly separated point clouds — Group A, Group B, Group C — sitting in different regions of (Feature 1, Feature 2) space.

Why it matters

In supervised learning (regression, Topics 1–3) we had labels \(Y\). In clustering we discover natural groups with no labels — this is unsupervised learning!

The Clustering Workflow

The five-step pipeline:

1. Collect Data → 2. Standardize Features → 3. Choose \(K\) → 4. Run Clustering → 5. Interpret Clusters

  • Step 2 is critical — variables on different scales distort distances
  • Step 3 uses the Elbow Plot (K-Means) or Dendrogram (Hierarchical)
  • Step 5 is where business value lives — naming and acting on segments
Key takeaway

Clustering = finding natural groups in data without a target variable. Today we learn two methods: K-Means and Hierarchical Clustering.

Clustering in Everyday Life

You already cluster things every day:

  • 🎵 Music playlist: You sort songs into “Workout,” “Chill,” “Study” — songs in the same playlist sound similar to each other
  • 👕 Sorting laundry: Whites, darks, colours — you group by a feature (colour) without anyone telling you the groups
  • 🍽️ Grocery store aisles: Dairy, produce, snacks — items that “belong together” are placed together

In each case, you look at features (tempo, colour, food type) and group items that are similar.

Three laundry baskets — Whites, Darks, Colours — illustrate that no one labels the items; you discover the groups yourself.

Key takeaway

Clustering is the same idea applied to data: the computer looks at numbers (features) and groups data points that are close to each other. The key question is: how do we measure “closeness”?

How Do We Measure Similarity?

Background: Measuring Similarity

Every clustering algorithm rests on a single foundational question: how do we define “similar”? Before any algorithm can group points, it needs a ruler — a distance measure that assigns a number to every pair of observations. Smaller number means more similar; larger number means more different. The choice of ruler profoundly shapes the clusters you obtain.

Euclidean distance is the oldest and most natural metric. It is precisely Pythagoras’s theorem generalized to \(p\) dimensions: the distance between two points is the square root of the sum of squared coordinate differences. For two customers differing only in age, Euclidean distance reduces to the absolute age difference. For two customers differing in age, income, and spending score simultaneously, it combines all three differences into a single number via the Pythagorean formula.

Manhattan distance (also called the L1 norm or taxicab distance) was formalized by Hermann Minkowski in the early 20th century as part of a broader family of \(L_p\) norms. The name “taxicab” reflects the geometry of a grid city: a taxi can only travel along streets, not diagonally, so the distance between two intersections is the sum of the horizontal and vertical displacements, not the straight-line flight distance. Manhattan distance is less sensitive to large individual differences on one feature — it does not square the deviations — making it more robust to outliers in high-dimensional settings.

Correlation distance (\(1 - \text{corr}(A,B)\)) is conceptually different: it measures whether two observations have the same shape or profile across features, regardless of their absolute magnitude. Two companies with identical PE ratios and growth rates will have a correlation distance of zero even if one is twice the size of the other. This is particularly valuable in financial analysis where you care about pattern similarity rather than absolute level.

Standardization is not itself a distance measure but a pre-processing step that makes distance measures meaningful. Without it, features measured in dollars (income = 40,000–140,000) will numerically dominate features measured in single digits (spending score = 1–100), simply because the scale is larger. The algorithm will effectively ignore the small-scale features, even if they are economically important. Standardizing every feature to mean zero and unit variance — the z-score transform — gives each feature an equal vote in the distance computation.

Warning

The single biggest clustering mistake in practice is skipping standardization. If you cluster Age (20–70) and Annual Income ($15,000–$140,000) without standardizing, the income variable dominates the distance calculation by a factor of roughly 1,000. The algorithm will produce clusters that are almost entirely determined by income, as if age does not exist. Always standardize first.

Euclidean Distance

For two points \(A=(a_1,\dots,a_p)\) and \(B=(b_1,\dots,b_p)\):

\[ d(A,B) = \sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + \cdots + (a_p-b_p)^2} \]

Example (2 features):

  • Customer \(A\): Age = 35, Income = 60k
  • Customer \(B\): Age = 28, Income = 80k

\[ d = \sqrt{(35-28)^2 + (60-80)^2} = \sqrt{49+400} = \sqrt{449} \approx 21.2 \]

Interpretation: The distance of 21.2 is a raw number measured in the same mixed units as your features — part years of age, part thousands of dollars of income. By itself, 21.2 tells you nothing useful until you compare it to the distances between other pairs of customers. If most customer pairs have distances between 5 and 15, then 21.2 is a relatively large gap and these two customers are quite different. If most pairs have distances between 50 and 200, then 21.2 means they are actually quite similar. This is precisely why Euclidean distance is most informative after standardization: once all features are on a common z-score scale, a distance of 1.0 always means “roughly one standard deviation apart across all features,” giving you a calibrated ruler.

Warning

Pitfall: Do not interpret raw Euclidean distances as meaningful numbers without first understanding the scale of your features. A distance of 21.2 when Age is in years and Income is in thousands is dominated almost entirely by the income dimension (\((60-80)^2 = 400\) vs. \((35-28)^2 = 49\)). Standardize first.

Euclidean Distance: Step-by-Step Hand Calculation

Let us slow down and compute the distance between two customers, one step at a time.

Age Income ($k) Spending Score
Customer A 25 40 70
Customer B 30 55 50

Step 1: Subtract each feature of \(B\) from \(A\): \[(25-30),\quad (40-55),\quad (70-50) = -5,\quad -15,\quad 20\]

Step 2: Square each difference: \[(-5)^2,\quad (-15)^2,\quad (20)^2 = 25,\quad 225,\quad 400\]

Step 3: Add the squared differences: \(25 + 225 + 400 = 650\)

Step 4: Square root: \(d(A,B) = \sqrt{650} \approx 25.5\)

Important

Notice that Spending Score (difference of 20) and Income (difference of 15) contribute much more than Age (difference of 5). Features with larger numbers dominate the distance — this is why we need standardization!

Other Distance Measures

Distance Formula When to use
Euclidean \(\sqrt{\sum (a_i - b_i)^2}\) Default; continuous features
Manhattan \(\sum \lvert a_i - b_i \rvert\) Robust to outliers; grid-like data
Correlation \(1 - \text{corr}(A,B)\) When shape matters, not magnitude
Important

Different distance metrics can produce very different clusters! Always think about what “similar” means for your business problem.

Why Standardize?

The picture below is the clearest motivation. The raw data has two real clusters separated top vs bottom — a horizontal cut would correctly divide them. But the \(x\)-axis is much wider than the \(y\)-axis, so a small horizontal gap of \(\sim 0.4\) contributes far more to Euclidean distance than the larger vertical gap of \(\sim 0.5\). K-Means follows the dominant axis and cuts the data left vs right with a vertical boundary — exactly the wrong split. After standardizing each feature to the same scale, the vertical gap finally counts as much as the horizontal one, and K-Means recovers the correct top/bottom split.

Problem: Without standardizing, the axis with larger numbers dominates K-Means.

Solution: \(z_i = \dfrac{x_i - \bar{x}}{s_x}\) (mean \(=0\), std \(=1\))

Interpretation: After StandardScaler, the mean of each column is 0.00 and the standard deviation of each column is 1.00 (up to numerical precision). The customer who was 45 years old is now +1.28 standard deviations above the mean age. The customer who earned $40,000 is now −1.25 standard deviations below the mean income. Now when you compute Euclidean distance between two customers, each feature contributes proportionally to how unusual the difference is — not proportionally to the raw units. A 5-year age difference and a $12,000 income difference each contribute roughly the same amount to the total distance, which is a far more defensible definition of “similar.”

In practice

StandardScaler is the default choice for clustering. An alternative is MinMaxScaler, which maps each feature to the range \([0, 1]\) instead of standardizing to zero mean and unit variance. Min-max scaling is sensitive to outliers (one extreme value compresses everything else into a small range), so z-score standardization is generally preferred for clustering unless you have a specific reason to prefer bounded features — for example, when feeding data to a neural network with bounded activation functions.

Key takeaway

Always standardize before clustering so every feature gets an equal vote.

K-Means Clustering

Background: The K-Means Algorithm

K-Means is deceptively simple. The entire algorithm reduces to two alternating steps repeated until nothing changes: assign each point to its nearest centroid, then update each centroid to the mean of its members. This assign-update cycle is guaranteed to terminate — the objective function (WCSS, or Within-Cluster Sum of Squares) strictly decreases at each step and is bounded below by zero, so the algorithm must eventually converge.

The convergence guarantee comes with an important caveat: K-Means converges to a local minimum, not necessarily the global one. The final clustering depends on which \(K\) points were chosen as initial centroids. Two runs of the algorithm on the same data with different random seeds can produce different final clusters, each locally optimal but potentially far from each other in quality. This is why scikit-learn’s default is n_init=10: run the algorithm 10 times with different random initializations and keep the best result (the one with the lowest WCSS). For important analyses, increase n_init to 20 or 50.

A more sophisticated initialization strategy called K-Means++ (Arthur and Vassilvitskii, 2007) chooses initial centroids that are spread far apart from each other, dramatically reducing the probability of landing in a poor local minimum. Scikit-learn uses K-Means++ initialization by default (init='k-means++'), which is why you rarely need to increase n_init beyond 10 in practice.

Why does K-Means always converge? Each iteration can only change the cluster assignment of a point if a different centroid is now closer. Every such reassignment reduces WCSS (or leaves it unchanged). Since there are only finitely many possible partitions of \(n\) points into \(K\) groups, the algorithm cannot cycle and must stop. The formal proof uses the fact that WCSS is a Lyapunov function for the algorithm — a non-negative quantity that never increases — which is a standard tool in optimization theory.

The connection to Gaussian Mixture Models: K-Means is actually a special (hard-assignment, equal-variance) case of the Expectation-Maximization (EM) algorithm for fitting Gaussian Mixture Models (GMMs). Where K-Means assigns each point to exactly one cluster (hard assignment), a GMM assigns each point a probability of belonging to each cluster (soft assignment). If you need soft boundaries between clusters — for example, customers who are ambiguously between two segments — GMMs are the natural extension.

In practice

For most business applications with \(n < 100{,}000\) observations, K-Means with n_init=10 and K-Means++ initialization (sklearn’s default) is robust and fast. For very large datasets (\(n > 1{,}000{,}000\)), Mini-Batch K-Means (sklearn.cluster.MiniBatchKMeans) processes random subsets at each iteration, achieving comparable quality at a fraction of the computational cost.

K-Means: The Algorithm

  1. Choose \(K\) initial centroids (randomly)
  2. Assign each point to the nearest centroid
  3. Recompute centroids as cluster means
  4. Reassign points to new nearest centroids
  5. Repeat steps 2–4 until no points change cluster

The algorithm alternates between assignment and update. It always converges, though possibly to a local optimum.

What Is a Centroid?

A centroid is simply the average position of all points in a cluster — the “center of gravity.”

Analogy: Imagine balancing a cardboard cutout on your fingertip. The balance point is the centroid.

Numerical example:

Age Income
Customer 1 20 30
Customer 2 30 50
Customer 3 40 40
Centroid 30 40

\[\text{Centroid} = \left(\frac{20+30+40}{3},\; \frac{30+50+40}{3}\right) = (30, 40)\]

Key takeaway

The centroid is the mean of all points in the cluster. K-Means works by repeatedly moving centroids to better represent their members.

K-Means Step 1: Initialize Random Centroids

How it starts:

  1. We choose \(K\) (e.g., \(K=3\))
  2. The algorithm picks 3 random points as initial “centroids”
  3. All data points are currently unassigned
Important

Different random starting positions can lead to different final clusters! That is why n_init=10 runs the algorithm 10 times and keeps the best result.

K-Means Step 2: Assign Each Point to Nearest Centroid

Assignment rule: For each data point, compute the Euclidean distance to every centroid. Assign the point to the closest one.

Example: Point \(P=(1.0, 4.0)\)

  • \(d(P, \mu_1) = \sqrt{0 + 1} = 1.0\)
  • \(d(P, \mu_2) = \sqrt{4 + 4} = 2.8\)
  • \(d(P, \mu_3) = \sqrt{2.25 + 0.25} = 1.6\)

\(\Rightarrow\) Closest is \(\mu_1\), so \(P\) joins the red cluster.

After assigning all points, each one is coloured by its cluster.

K-Means Step 3: Recompute Centroids (Cluster Averages)

Update rule: compute the mean of all member points \(\rightarrow\) new centroid.

Red cluster: \((0.5,1.5),\;(1.0,2.0),\;(0.8,1.2),\;(1.5,1.8),\;(1.2,1.0)\)

\[\mu_1 = \left(\frac{0.5+1.0+0.8+1.5+1.2}{5},\;\frac{1.5+2.0+1.2+1.8+1.0}{5}\right) = (1.0,\,1.5)\]

Centroid moved from \((1.0, 3.0)\) to \((1.0, 1.5)\)!

⭐ Key: This is why it is called K-Means — centroids are the mean of their members.

K-Means Step 4: Repeat Until Convergence

Now repeat Steps 2–3:

  1. Re-assign each point to the nearest new centroid
  2. Recompute centroids again
  3. Repeat until no points change cluster

Typically converges in 10–20 iterations.

⭐ The full K-Means loop:

Assign \(\rightarrow\) Update \(\rightarrow\) Assign \(\rightarrow\) Update \(\rightarrow\) … until nothing changes.

K-Means in Python

  • n_clusters: number of clusters \(K\) (you choose)
  • random_state: reproducibility seed
  • n_init=10: run algorithm 10 times, keep best

Interpretation: The output shows cluster sizes (approximately 50 members each, since we simulated three equal groups) and a WCSS value. The cluster sizes tell you whether K-Means found balanced groups or highly uneven ones — very uneven sizes (e.g., one cluster with 140 members and two with 5 each) are a warning sign that \(K\) may be wrong, or that the data has outliers pulling centroids. The WCSS value on its own is not informative; it only becomes useful when compared across different values of \(K\) in the Elbow Plot.

Warning

Pitfall: comparing cluster labels across runs. The cluster numbers K-Means assigns (0, 1, 2, …) are arbitrary. If you run K-Means twice on the same data, “Cluster 0” in run 1 might correspond to “Cluster 2” in run 2. Never compare cluster labels across different runs without first aligning them. This also means you cannot interpret cluster numbers as an ordering (Cluster 1 is not “better” than Cluster 2 in any sense).

Warning

Pitfall: outliers dragging centroids. A single extreme customer with income $2,000,000 will pull the centroid of whichever cluster it joins far away from the other members, producing a large but sparse cluster. Always check for outliers before clustering and consider whether to remove or Winsorize them.

Choosing \(K\): The Elbow Plot

WCSS (Within-Cluster Sum of Squares): \[\text{WCSS} = \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2\]

  • WCSS always decreases as \(K\) increases
  • Look for the “elbow” — the point where adding more clusters gives diminishing returns

Interpretation: The elbow plot shows WCSS on the vertical axis and \(K\) on the horizontal axis. As \(K\) increases from 1 to 10, WCSS always decreases — with more centroids, every point is closer to its nearest center. The useful information is in the rate of decrease. From \(K=1\) to \(K=2\), WCSS drops steeply because splitting one large cluster into two captures a real structure in the data. From \(K=2\) to \(K=3\), another big drop. From \(K=3\) to \(K=4\), the drop is much smaller. The “elbow” at \(K=3\) is the point of diminishing returns — adding a fourth cluster only marginally reduces within-cluster variance, which is not worth the added complexity.

When the elbow is ambiguous — the curve looks smooth with no sharp bend — consider using the silhouette score (higher is better; measures how much tighter a point is to its own cluster than to its nearest neighbor cluster) or the gap statistic (compares WCSS to a null reference distribution) as supplementary criteria. In practice, domain knowledge often resolves ambiguity: if you have a marketing budget for four campaigns, \(K=4\) is the natural choice even if the elbow is at \(K=3\).

Warning

Pitfall: picking \(K\) to match aesthetic preference. A common mistake is to pick \(K=5\) because “five segments feels like a round number” or \(K=3\) because “three is the classic Good/Better/Best framework.” Always ground your \(K\) choice in the elbow plot (or silhouette score) first, then verify the chosen \(K\) makes business sense. The algorithm and the business judgment should converge on the same answer.

WCSS: What Does It Actually Measure?

WCSS stands for Within-Cluster Sum of Squares. In plain language:

  1. For each cluster, measure how far each point is from the cluster’s centroid
  2. Square those distances (so bigger gaps count more)
  3. Add them all up across every cluster

Lower WCSS = tighter clusters = better fit.

Think of it like this: WCSS measures how “spread out” the points are within their assigned groups. If every point sat exactly on its centroid, WCSS would be zero (perfect, but unrealistic).

Imagine a centroid star with dashed lines connecting it to each member point. WCSS is the sum of the squared lengths of all those dashed lines.

Why it matters

The Elbow Plot shows WCSS for different values of \(K\). As you add more clusters, WCSS always goes down (more centroids = shorter distances). The “elbow” is where the improvement slows dramatically — adding more clusters is not worth the extra complexity.

Try It! — K-Means on Customer Data

Try it!

Challenge: You have a dataset with 200 mall customers (Age, Annual Income, Spending Score).

  1. Standardize the features using StandardScaler
  2. Create an Elbow Plot for \(K = 1, 2, \ldots, 10\)
  3. Apply K-Means with your chosen \(K\)
  4. Add the cluster labels to the DataFrame and create a scatter plot of Income vs. Spending Score, coloured by cluster

Hint: Look at the notebook Section 5 for the full walkthrough!

K-Means: Pitfalls and Limitations

1. Sensitive to initialization Different random starts → different clusters. Fix: n_init=10 (run 10 times, keep best).

2. Sensitive to outliers Extreme points pull centroids away from true centers.

3. Assumes spherical clusters Fails on elongated or crescent-shaped groups.

4. Must choose \(K\) in advance Elbow plot helps, but it is a judgment call.

K-Means struggles with: (i) two interlocking crescent-shaped clusters, (ii) an extreme outlier that drags the centroid toward itself.

Key takeaway

⚠️ Always visualise your clusters and check if they make business sense!

Hierarchical Clustering

Background: Hierarchical Clustering

K-Means is a greedy partitioner: you tell it \(K\), it divides the data into \(K\) groups and stops. If you want to see what the data looks like with \(K=3\) and \(K=5\) and \(K=7\), you must run it three separate times. Hierarchical clustering takes a different philosophy: run the algorithm once and obtain a complete tree of all possible partitions, from \(N\) singletons (every point in its own cluster) to 1 giant group (everything in one cluster). After fitting, you choose \(K\) by cutting the tree at any height.

The agglomerative (bottom-up) approach starts with \(N\) clusters and merges them one pair at a time. At each step, the algorithm finds the two closest clusters and merges them. This merge is recorded as a node in the dendrogram at a height equal to the distance between the two clusters. After \(N-1\) merges, all points are in one cluster and the tree is complete.

The divisive (top-down) approach works in the opposite direction: start with all points in one cluster and recursively split. Divisive clustering is less commonly used in practice — it is computationally expensive and the splits are harder to define — so this chapter focuses on agglomerative methods.

Linkage methods define how you measure the distance between two clusters (not two points). This choice matters enormously:

  • Single linkage uses the minimum distance between any two points across the two clusters. It is fast to compute but produces the “chaining” problem: a cluster can grow by absorbing one point at a time, creating long, stringy chains rather than compact groups.
  • Complete linkage uses the maximum distance. It produces compact, roughly equal-sized clusters but is sensitive to outliers (one outlier in a cluster inflates the inter-cluster distance).
  • Average linkage uses the mean of all pairwise distances — a reasonable compromise.
  • Ward’s method (Joe Ward, 1963) asks: if we merge these two clusters, how much does total WCSS increase? It merges the pair that produces the smallest WCSS increase. Ward’s is the default for most business applications because it produces the most balanced, well-separated clusters and corresponds to the same objective function as K-Means.

The dendrogram is the visual output of hierarchical clustering. Each leaf at the bottom represents one data point. Moving up the tree, you see merges happening at increasing heights. The height of a merge reflects how dissimilar the two clusters were when they joined. Tall vertical lines before a merge mean the two groups were far apart — a clear separation. Short lines mean the groups were similar and the split at that level may not be meaningful.

In practice

Hierarchical clustering scales as \(O(n^2 \log n)\) in time and \(O(n^2)\) in memory (it must store the full distance matrix). For datasets with \(n > 10{,}000\) observations, K-Means is typically more practical. Hierarchical clustering is the preferred choice for \(n < 1{,}000\) when you want to explore the full cluster structure without committing to a specific \(K\) in advance — common in exploratory research, financial analysis, and genomics.

Hierarchical Clustering: Bottom Up

Picture three stages:

  • Start: \(N\) clusters — each point A, B, C, D is its own cluster
  • Merge closest — A and B join into one cluster
  • Merge again — {A, B} joins with C into {A, B, C}; D remains separate

Agglomerative algorithm:

  1. Start with each data point as its own cluster (\(N\) clusters)
  2. Compute distance between all pairs of clusters
  3. Merge the two closest clusters
  4. Repeat steps 2–3 until only 1 cluster remains
Why it matters

Unlike K-Means, hierarchical clustering does not require you to pre-specify \(K\). You choose \(K\) after by cutting the dendrogram.

How to Measure Cluster Distance? — Linkage Methods

Linkage Distance between clusters Characteristics
Single Min distance between any two points Can create long “chaining” clusters
Complete Max distance between any two points Tends to produce compact clusters
Average Mean distance between all point pairs Compromise between single & complete
Ward Increase in total within-cluster variance Most balanced, compact clusters
Important

Ward linkage tends to give the most balanced, well-separated clusters. It is the recommended default for most business applications.

Computing Average Linkage: A Worked Example

Suppose we want to merge cluster \(\{A, B\}\) with cluster \(\{C, D\}\) using Average Linkage.

Average linkage = mean of all pairwise distances between the two clusters.

Pair Distance
\(d(A, C)\) 5
\(d(A, D)\) 9
\(d(B, C)\) 4
\(d(B, D)\) 7
Average \(\frac{5+9+4+7}{4} = \mathbf{6.25}\)

There are \(2 \times 2 = 4\) pairs between the two clusters, so we average all 4 distances.

Compare with other linkages:

  • Single = \(\min(5,9,4,7) = 4\)
  • Complete = \(\max(5,9,4,7) = 9\)
  • Average = \(6.25\) (in between!)

Hierarchical Clustering: Merge Example with 5 Points

Distance matrix (Euclidean):

A B C D E
A 0 2 6 10 9
B 2 0 5 9 8
C 6 5 0 4 5
D 10 9 4 0 3
E 9 8 5 3 0

Step-by-step merges (single linkage — use minimum distance):

  1. Merge A & B (distance = 2, the smallest)
  2. Merge D & E (distance = 3, next smallest)
  3. Merge C with {D,E} (distance = min(5,4,5) = 4)
  4. Merge {A,B} with {C,D,E} (distance = min(6,5,…) = 5)

The resulting mini-dendrogram has merges at heights 2, 3, 4, and 5.

The Dendrogram: Reading the Tree

A dendrogram tree:

  • A and B merge at height 1
  • D and E merge at height 1.5
  • {A,B} joins C at height 3
  • {A,B,C} joins {D,E} at height 5.5
  • A red dashed cut line at height 2.5 gives K = 3

How to read:

  • Leaves (bottom) = individual data points
  • Height of a merge = distance at which two clusters joined
  • Cut the tree horizontally to get \(K\) clusters
  • Cutting at height 2.5 gives 3 clusters: {A,B}, {C}, {D,E}
Key takeaway

Tall vertical lines before a merge mean those clusters are well separated. Short lines mean the groups are similar.

Dendrogram in Python

scipy gives you the dendrogram visualization; sklearn gives you the cluster labels directly.

Interpretation: The dendrogram height where two branches merge tells you how different those groups were when they joined. In this simulated dataset (three well-separated groups), you should see two early merges at low heights — points within the same true cluster joining each other — followed by a large gap before the inter-group merges at much higher heights. The correct \(K\) is the number of branches below the largest gap. If the tallest vertical lines you can draw without crossing a horizontal merge line spans three branches, then \(K=3\). The red cut line shows where you slice the tree horizontally to obtain your chosen number of clusters.

The AgglomerativeClustering output (cluster labels 0, 1, 2 …) is directly analogous to K-Means labels — you can add them to a DataFrame and use groupby to profile each cluster exactly as in Section 2.

Warning

Pitfall: choosing the wrong linkage for your data’s shape. Single linkage is prone to chaining — it will happily merge two compact groups through a thin bridge of outlier points. Ward linkage assumes roughly spherical, equal-variance clusters. For data with elongated clusters or very unequal cluster sizes, average linkage or complete linkage may be more appropriate. Visualize the dendrogram and the resulting scatter plot to check that the clusters look sensible.

Try It! — Hierarchical vs. K-Means

Try it!

Challenge: Using the same customer dataset:

  1. Build a dendrogram using Ward linkage
  2. Cut it to get 5 clusters
  3. Compare the cluster assignments to those from K-Means (\(K=5\))
  4. Do the two methods agree? On how many customers do they disagree?

Hint: Use scipy.cluster.hierarchy.fcluster(Z, t=5, criterion='maxclust')

K-Means vs. Hierarchical: When to Use Which?

Decision flowchart:

  • Know \(K\)?
    • No → Hierarchical (explore tree)
    • Yes → next question
  • Large data (\(n > 1000\))?
    • Yes → K-Means (faster)
    • No → Try both!

Practical decision rules:

  • K-Means: fast, simple, works well with round, evenly sized clusters
  • Hierarchical: see the full merging structure; decide \(K\) after
  • When in doubt: run both and compare — if they agree, more confidence in your clusters

⭐ Bottom line: No single “correct” method. Try both, compare, and let the data guide you.

Case Study: Forbes Financial Data

Background: Clustering for Industry Classification

One of the most compelling validations of clustering in finance is that financial ratios encode industry membership. Companies in the same industry face similar competitive dynamics, capital structures, growth opportunities, and accounting conventions — and these similarities show up systematically in the numbers. A chemical company and a pharmaceutical company both have high capital expenditures and research spending, but the chemical company has lower margins and a more cyclical revenue profile. A grocery retailer has thin margins, high asset turnover, and a very different payout ratio from a healthcare firm.

If clustering can recover industry groups from financial data alone — without ever seeing the industry labels — that validates two things simultaneously: (1) the clustering algorithm is working correctly, and (2) the financial features we chose actually carry industry-relevant information. This is a form of external validation: we compare the algorithm’s output to a ground-truth labeling we withheld.

The Forbes financial dataset is a classic teaching example for this reason. It contains 25 companies from three industries (Chemical, Healthcare, Grocery) described by seven financial metrics. The question is: given only the numbers, can Ward’s hierarchical clustering recover the three industries? The answer is: mostly yes, with the interesting exceptions being the “financially anomalous” companies that are outliers within their own industry — they look like a different industry on paper.

This exercise also introduces an important idea: clustering is not always about discovering unknown structure. Sometimes you already know the groups and you use clustering to ask, “are these groups financially distinguishable?” The cross-tabulation of cluster assignments vs. known industry labels is a rigorous answer to that question.

In practice

Financial institutions use clustering for industry classification as a complement to SIC (Standard Industrial Classification) and NAICS codes. Official industry codes are self-reported and sometimes outdated; financial-ratio clustering can reveal that a “conglomerate” has a financial profile closer to a utilities company, or that a “technology” company actually behaves like a services firm. This has implications for peer group selection in equity valuation, benchmark construction, and risk factor attribution.

Forbes Data: Overview

Dataset: 25 Forbes-listed companies

Variables (7 financial metrics):

  • PE ratio, 5-year return on revenue
  • Debt-to-equity ratio
  • 5-year sales growth, 5-year EPS growth
  • Net profit margin, payout ratio

Known industries: Chemical (14), Healthcare (5), Grocery (6)

The question

Can the algorithm discover industry groups without knowing the labels?

# Example: load a real CSV from the book's data folder
# Forbes = pd.read_csv("data/forbes_companies.csv")
# X = Forbes.iloc[:, 3:]        # numeric columns only
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)

Forbes: Building the Dendrogram

  • We use Ward linkage (minimises within-cluster variance) — the recommended default
  • The red dashed line shows where we cut to get \(K=3\) clusters
  • We then compare the algorithm’s clusters to the known industry labels

Interpretation: The cross-tabulation (pd.crosstab) is the key output. A perfect recovery would show each cluster mapping exactly to one industry. In practice, you expect near-perfect recovery for the most financially distinctive industry (Healthcare, with its uniquely high PE ratios and margins) and some overlap between the two more similar industries (Chemical and Grocery). “Misclassified” companies are not errors — they are the most interesting data points. A chemical company that clusters with grocers has a financial profile more characteristic of a low-margin, high-turnover business. That is a finding worth investigating.

The dendrogram also reveals within-industry structure. Among the 14 chemical companies, some merge early (at low height) into tight sub-clusters of 2–3 firms, while others are more isolated. These sub-clusters may correspond to sub-sectors (e.g., specialty chemicals vs. commodity chemicals) that the algorithm discovered without any labels.

Warning

Pitfall: clustering on irrelevant features. The Forbes analysis uses seven financial metrics that are known to be informative about industry membership. If you added irrelevant features — say, the CEO’s age, or the stock price on a random day — you would introduce noise that distorts distances and degrades cluster quality. Feature selection matters as much in clustering as in supervised learning. Include only features that you believe carry information about the natural groupings you are trying to discover.

Warning

Pitfall: assuming clusters are “correct”. The K=3 clustering of Forbes companies is useful — it tells us Healthcare is financially distinct from Chemical and Grocery — but it is not “true” in any deep sense. A K=4 solution might split Chemical into specialty and commodity sub-groups, which could be equally valid. Clustering is a tool for description and exploration, not for establishing ground truth. The business analyst’s judgment about which solution is most actionable is part of the method.

Forbes: The Dendrogram

Reading the Forbes dendrogram (from the previous slide’s plot):

  • Companies that merge at low height are very similar (e.g., chemical firms with comparable PE and payout ratios)
  • The red dashed line cuts the tree into 3 clusters
  • Read bottom to top: which companies group together?

Trace specific company codes (C1, C2, …, H1, H2, …, G1, G2, …) and note that the healthcare firms (H*) tend to form their own branch high up — they look very different from the chemicals and groceries on financial metrics alone.

Forbes: Interpretation and Cross-Tab

Cluster vs. Industry cross-tabulation:

Industry C1 C2 C3
Chemical (14) 0 11 3
Grocery (6) 0 4 2
Healthcare (5) 4 0 1

Reading the table:

  • C1 = almost pure Healthcare (4 of 5)
  • C2 = mostly Chemical + Grocery
  • C3 = mixed (3 Chem + 2 Groc + 1 Heal)

Key insights:

  • Healthcare is the most distinct — high PE, high growth, high margins
  • Chemical and Grocery overlap — similar margins and payout ratios
  • “Misclassified” companies have financial profiles resembling another industry

⭐ Takeaway: Clustering recovered meaningful industry groups from financial data alone — companies in the same industry share “financial DNA.”

Case Study: Customer Segmentation

Background: The Marketing Science of Segmentation

Customer segmentation is the oldest application of clustering in business. Long before the term “data science” existed, marketers used demographic surveys and purchase records to classify customers into groups and tailor campaigns accordingly. What has changed is scale and precision: modern companies have behavioral data on millions of customers, and clustering algorithms can extract segments that are far more nuanced than the traditional demographic buckets (age group, income bracket, gender).

The foundational insight is the 80/20 rule (Pareto principle): in most retail businesses, roughly 20% of customers generate 80% of revenue. Knowing which 20% — and more importantly, what they look like so you can find more like them — is enormously valuable. Clustering is one of the primary tools for answering this question.

But the value of segmentation extends far beyond identifying the top customers. Each segment has its own price sensitivity, preferred communication channel, motivational triggers, and lifetime value trajectory. A “Budget Shopper” who visits frequently but spends little per visit may have enormous lifetime value if you can convert them to a higher-spending segment with the right loyalty program. An “Impulsive Buyer” with a high spending score but modest income may be vulnerable to economic downturns — a credit risk for BNPL (Buy Now Pay Later) programs.

The three-feature mall customer dataset (Age, Annual Income, Spending Score) is deliberately simplified for teaching purposes. In practice, a retail bank’s customer segmentation model might include hundreds of behavioral features: transaction frequency, average transaction size, product categories purchased, days since last purchase, response rate to past campaigns, customer service contact history, digital engagement metrics. The mechanics are identical — standardize, cluster, profile — but the interpretation requires deeper domain expertise.

In practice

Telecom customer churn segmentation is a canonical industry application. A telecom company clusters customers by usage patterns, contract tenure, complaint history, and plan type. One segment — typically younger, month-to-month contract, heavy data user — has 3–4× the churn rate of the overall population. By identifying this segment proactively, the retention team can offer targeted upgrades or discounts before the customer decides to leave, converting a reactive “save” into a proactive retention. Studies suggest proactive retention of high-churn segments costs 5–10× less than acquiring an equivalent new customer.

Mall Customer Segmentation: Business Context

Scenario: A shopping mall wants to understand its customers to target marketing.

Variables:

  • 💻 Age (18–70)
  • 💻 Annual Income ($15k–$140k)
  • 💻 Spending Score (1–100)

Goal: Segment 200 customers into meaningful groups for targeted marketing campaigns.

A 2D map of (Income, Spending) showing five blobs in different corners — high-income/high-spending VIPs, high-income/low-spending careful spenders, low-income/high-spending impulsive buyers, low-income/low-spending budget shoppers, and middle-of-the-pack average customers.

Mall Customers: Elbow Plot

The elbow at \(K=5\) suggests five natural customer segments.

Interpretation: The steep drop from \(K=1\) to \(K=5\) and then the much flatter curve beyond \(K=5\) confirms that five is the natural number of segments in this data. From \(K=5\) to \(K=6\), WCSS drops by only a small amount relative to earlier steps — the sixth cluster is subdividing an already well-defined group, not capturing a new natural grouping. In a real dataset the elbow might be less sharp. When the elbow plot is ambiguous between \(K=4\) and \(K=5\), a useful sanity check is to run both and ask: do the two extra segments in \(K=5\) have meaningfully different profiles and require different marketing strategies? If yes, use 5. If the two extra segments look essentially the same, use 4.

Mall Customers: K-Means with K=5

The plot shows five clearly separated groups in (Income, Spending) space — one in each “corner” plus one in the middle.

Interpretation: The scatter plot immediately reveals the business-relevant structure: the five segments occupy distinct regions of (Income, Spending Score) space. This geometric separation is what makes the clustering actionable — each segment occupies a different “quadrant” of the customer behavior space. Notice that the cluster in the upper-right corner (high income, high spending) is the smallest by count but highest in value per customer. The cluster in the lower-left (low income, low spending) is probably the largest by count but lowest in value. The “average” cluster in the middle has the most variance — it is the catch-all for customers who do not fit neatly into any corner.

Warning

Pitfall: forgetting that clustering is for description, not prediction. The cluster labels assigned by K-Means do not predict whether a new customer will be a “VIP” — they describe the current customer base. To predict segment membership for new customers, you need a supervised classifier (e.g., logistic regression or decision tree) trained on the cluster labels as \(Y\). Clustering produces the labels; supervised learning uses them.

Mall Customers: Naming the Segments

Cluster Name Income Score Marketing Strategy
1 VIP Clients High High Loyalty programs, exclusive events
3 Impulsive Buyers Low High Affordable premium, BNPL
0 Average Customers Mid Mid Newsletters, general promos
2 Careful Spenders High Low Personalized outreach, rewards
4 Budget Shoppers Low Low Discount bundles, value packs
Key takeaway

Clustering transforms raw data into actionable business segments. Each cluster gets a tailored strategy — that is the real value of unsupervised learning.

Profiling Clusters with groupby

How do we know what each cluster “looks like”? Use groupby to compute the mean of each feature per cluster — the same pandas skill from Topic 2!

Example output (numbers vary slightly by seed):

Cluster Avg Age Avg Income ($k) Avg Spending Score
0 42.7 55.3 49.5
1 32.7 86.5 82.1
2 41.1 88.2 17.1
3 25.3 25.7 79.4
4 45.2 26.3 20.9
Why it matters

The cluster profile table is how you name your segments. Cluster 1 has high income and high spending \(\Rightarrow\) “VIP.” Cluster 4 has low income and low spending \(\Rightarrow\) “Budget.” The numbers tell the story!

Interpretation: The profile table is the bridge from algorithm output to business action. Reading across each row, you reconstruct a “persona” for each cluster. Cluster 1 (high income \(\approx \$87k\), high spending score \(\approx 82\), relatively young \(\approx 33\)) is the VIP segment: affluent, engaged, and likely already loyal. The marketing strategy is retention: premium loyalty rewards, early access to new products, exclusive events. Cluster 4 (low income \(\approx \$26k\), low spending score \(\approx 21\), older \(\approx 45\)) is the Budget segment: value-sensitive, infrequent high-value purchases. The strategy is conversion: discount bundles, value-pack promotions, savings messaging.

The centroid coordinates (the cluster means) are the mathematical definition of the “average member” of each segment. They are the most useful single summary of a cluster for business communication: instead of describing a distribution, you describe a representative persona.

In practice

Naming segments is as much art as science. Common industry naming conventions: Transactors / Revolvers / Dormant (banking); Champions / Loyal / At-Risk / Lost (RFM marketing); Prime / Standard / Subprime (credit); Early Adopters / Mainstream / Laggards (technology adoption). The names should be memorable, non-pejorative (avoid calling any segment “bad”), and actionable — the name itself should suggest the strategy.

Try it!

Extension: Try mall.groupby('Cluster').describe() to see not just the mean, but also min, max, and standard deviation for each cluster. Which cluster has the most variation?

In practice

Hierarchical vs. K-Means agreement. In the Try It! challenge from Section 3, you compared hierarchical clustering (Ward, K=5) to K-Means (K=5) on the same customer data. When both methods agree on the assignment of most customers, you have strong evidence that the clusters are real structure in the data, not artifacts of one algorithm. When they disagree on many customers, the cluster boundary in that region is ambiguous — worth investigating whether those customers belong to a distinct segment or simply live in an overlapping zone between two segments.

Summary

Chapter Summary: What You Have Learned

This chapter introduced the two foundational algorithms of unsupervised learning — K-Means and hierarchical clustering — and demonstrated their application to two real-world business problems. Here is what you should take away.

The unsupervised learning landscape. Clustering is the most widely deployed unsupervised method, but it sits within a broader family:

  • K-Means and K-Medoids: partition-based; fast; require \(K\) in advance
  • Hierarchical (agglomerative): tree-based; no \(K\) required upfront; produces dendrogram
  • DBSCAN (Density-Based Spatial Clustering): finds clusters of arbitrary shape; handles noise/outliers explicitly; particularly useful for geographic data
  • Gaussian Mixture Models (GMM): probabilistic, soft-assignment version of K-Means; useful when cluster boundaries are fuzzy
  • Spectral Clustering: uses the eigenstructure of the similarity matrix; handles non-convex clusters; computationally expensive for large \(n\)

For most business applications, K-Means or Ward hierarchical clustering will get you 90% of the way there. Start simple.

The 5-step clustering workflow:

  1. Collect — gather relevant features that you believe encode group membership
  2. Standardize — apply StandardScaler so every feature has equal weight in distance calculations
  3. Choose \(K\) — use the Elbow Plot for K-Means; use the dendrogram height gap for hierarchical
  4. Cluster — fit the algorithm; for K-Means, use n_init=10 and a fixed random_state
  5. Interpret — use groupby to profile clusters; name segments; connect to business actions

Step 5 is where most of the value is created — and where most of the effort should go. The algorithm does the math; you provide the judgment.

Decision flowchart: which method?

  • \(n > 10{,}000\) and speed matters → K-Means
  • \(n < 1{,}000\) and you want to explore the full structure → Hierarchical (Ward)
  • You do not know \(K\) and want to discover it → Hierarchical (cut the dendrogram)
  • You want soft cluster memberships → Gaussian Mixture Model
  • Clusters may be non-spherical or of very different densities → DBSCAN
  • In doubt → run both K-Means and hierarchical; compare results

What’s next. Clustering operates in the original feature space. When you have many features (say, 50 financial ratios), the distance calculations become noisy — in high dimensions, all points tend to be roughly equidistant from each other (the “curse of dimensionality”). The solution is dimensionality reduction: project the data into a lower-dimensional space that preserves the essential variation. Principal Component Analysis (PCA) is the linear version — it finds the directions of maximum variance and projects the data onto the first \(k\) principal components. t-SNE and UMAP are nonlinear alternatives that preserve local neighborhood structure, commonly used for visualizing high-dimensional clusters. These topics are the natural next step after mastering the algorithms in this chapter.

Further reading. The canonical references for the material in this chapter are:

  • The Elements of Statistical Learning (Hastie, Tibshirani, and Friedman, 2009) — Chapter 14 covers unsupervised methods rigorously, including K-Means, hierarchical clustering, and principal components. Freely available at https://web.stanford.edu/~hastie/ElemStatLearn/
  • An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani, 2021) — Chapter 12 covers the same material at a more accessible level, with R code examples. Freely available at https://www.statlearning.com/

Final reflection. Every algorithm in this chapter is a mathematical object: it optimizes a criterion, converges to a fixed point, produces a partition. But the hard work — deciding which features to cluster on, choosing \(K\), naming the segments, and designing the marketing strategy for each — requires human judgment, domain knowledge, and creativity that no algorithm possesses. The machine finds the pattern; you decide what it means and what to do about it. That is the fundamental reason why data science skills amplify rather than replace business judgment.

Key takeaway

Clustering is where mathematics meets business judgment. The algorithm finds the groups; you name them and act on them. Neither step works without the other.

Supervised vs. Unsupervised Learning

Supervised Unsupervised
(Topics 1–3) (Topic 4)
Goal Predict \(Y\) Find groups
Labels? Yes (need \(Y\)) No (no target)
Methods Regression, classification Clustering, PCA
Evaluation \(R^2\), MSE, accuracy Inertia, silhouette, domain expertise
Example Predict house price Segment customers

K-Means vs. Hierarchical Clustering

K-Means Hierarchical
Choose \(K\)? Before running After (cut dendrogram)
Speed Fast (large datasets) Slower (\(O(n^2)\) memory)
Visualization Scatter plot Dendrogram
Shape Spherical clusters Any shape
Reproducibility Depends on init Deterministic
Best for Large \(n\), clear groups Small \(n\), exploring structure
Key takeaway

Use K-Means for speed and simplicity. Use Hierarchical for richer exploration and when you are unsure about \(K\). In practice, try both and compare!

End of Topic 4

← Back to home

← PreviousLinear Regression

📖 Back to Contents

 

Prof. Xuhu Wan · HKUST ISOM · Intro to Business Analytics