Chapter 4: Clustering
Outline
- Why Clustering?
- How Do We Measure Similarity?
- K-Means Clustering
- Hierarchical Clustering
- Case Study: Forbes Financial Data
- Case Study: Customer Segmentation
- Summary
Why Clustering?
Why Clustering? — Business Motivation
Clustering is everywhere in business:
- 💻 Netflix segments viewers into taste groups to recommend movies
- 💻 Spotify creates playlist groups from listening behaviour
- 💻 Banks segment credit-risk customers for pricing and marketing
- 💻 Retailers group products by purchasing patterns
Imagine a scatter plot with three clearly separated point clouds — Group A, Group B, Group C — sitting in different regions of (Feature 1, Feature 2) space.
In supervised learning (regression, Topics 1–3) we had labels \(Y\). In clustering we discover natural groups with no labels — this is unsupervised learning!
The Clustering Workflow
The five-step pipeline:
1. Collect Data → 2. Standardize Features → 3. Choose \(K\) → 4. Run Clustering → 5. Interpret Clusters
- Step 2 is critical — variables on different scales distort distances
- Step 3 uses the Elbow Plot (K-Means) or Dendrogram (Hierarchical)
- Step 5 is where business value lives — naming and acting on segments
Clustering = finding natural groups in data without a target variable. Today we learn two methods: K-Means and Hierarchical Clustering.
Clustering in Everyday Life
You already cluster things every day:
- 🎵 Music playlist: You sort songs into “Workout,” “Chill,” “Study” — songs in the same playlist sound similar to each other
- 👕 Sorting laundry: Whites, darks, colours — you group by a feature (colour) without anyone telling you the groups
- 🍽️ Grocery store aisles: Dairy, produce, snacks — items that “belong together” are placed together
In each case, you look at features (tempo, colour, food type) and group items that are similar.
Three laundry baskets — Whites, Darks, Colours — illustrate that no one labels the items; you discover the groups yourself.
Clustering is the same idea applied to data: the computer looks at numbers (features) and groups data points that are close to each other. The key question is: how do we measure “closeness”?
How Do We Measure Similarity?
Background: Measuring Similarity
Every clustering algorithm rests on a single foundational question: how do we define “similar”? Before any algorithm can group points, it needs a ruler — a distance measure that assigns a number to every pair of observations. Smaller number means more similar; larger number means more different. The choice of ruler profoundly shapes the clusters you obtain.
Euclidean distance is the oldest and most natural metric. It is precisely Pythagoras’s theorem generalized to \(p\) dimensions: the distance between two points is the square root of the sum of squared coordinate differences. For two customers differing only in age, Euclidean distance reduces to the absolute age difference. For two customers differing in age, income, and spending score simultaneously, it combines all three differences into a single number via the Pythagorean formula.
Manhattan distance (also called the L1 norm or taxicab distance) was formalized by Hermann Minkowski in the early 20th century as part of a broader family of \(L_p\) norms. The name “taxicab” reflects the geometry of a grid city: a taxi can only travel along streets, not diagonally, so the distance between two intersections is the sum of the horizontal and vertical displacements, not the straight-line flight distance. Manhattan distance is less sensitive to large individual differences on one feature — it does not square the deviations — making it more robust to outliers in high-dimensional settings.
Correlation distance (\(1 - \text{corr}(A,B)\)) is conceptually different: it measures whether two observations have the same shape or profile across features, regardless of their absolute magnitude. Two companies with identical PE ratios and growth rates will have a correlation distance of zero even if one is twice the size of the other. This is particularly valuable in financial analysis where you care about pattern similarity rather than absolute level.
Standardization is not itself a distance measure but a pre-processing step that makes distance measures meaningful. Without it, features measured in dollars (income = 40,000–140,000) will numerically dominate features measured in single digits (spending score = 1–100), simply because the scale is larger. The algorithm will effectively ignore the small-scale features, even if they are economically important. Standardizing every feature to mean zero and unit variance — the z-score transform — gives each feature an equal vote in the distance computation.
The single biggest clustering mistake in practice is skipping standardization. If you cluster Age (20–70) and Annual Income ($15,000–$140,000) without standardizing, the income variable dominates the distance calculation by a factor of roughly 1,000. The algorithm will produce clusters that are almost entirely determined by income, as if age does not exist. Always standardize first.
Euclidean Distance
For two points \(A=(a_1,\dots,a_p)\) and \(B=(b_1,\dots,b_p)\):
\[ d(A,B) = \sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + \cdots + (a_p-b_p)^2} \]
Example (2 features):
- Customer \(A\): Age = 35, Income = 60k
- Customer \(B\): Age = 28, Income = 80k
\[ d = \sqrt{(35-28)^2 + (60-80)^2} = \sqrt{49+400} = \sqrt{449} \approx 21.2 \]
Interpretation: The distance of 21.2 is a raw number measured in the same mixed units as your features — part years of age, part thousands of dollars of income. By itself, 21.2 tells you nothing useful until you compare it to the distances between other pairs of customers. If most customer pairs have distances between 5 and 15, then 21.2 is a relatively large gap and these two customers are quite different. If most pairs have distances between 50 and 200, then 21.2 means they are actually quite similar. This is precisely why Euclidean distance is most informative after standardization: once all features are on a common z-score scale, a distance of 1.0 always means “roughly one standard deviation apart across all features,” giving you a calibrated ruler.
Pitfall: Do not interpret raw Euclidean distances as meaningful numbers without first understanding the scale of your features. A distance of 21.2 when Age is in years and Income is in thousands is dominated almost entirely by the income dimension (\((60-80)^2 = 400\) vs. \((35-28)^2 = 49\)). Standardize first.
Euclidean Distance: Step-by-Step Hand Calculation
Let us slow down and compute the distance between two customers, one step at a time.
| Age | Income ($k) | Spending Score | |
|---|---|---|---|
| Customer A | 25 | 40 | 70 |
| Customer B | 30 | 55 | 50 |
Step 1: Subtract each feature of \(B\) from \(A\): \[(25-30),\quad (40-55),\quad (70-50) = -5,\quad -15,\quad 20\]
Step 2: Square each difference: \[(-5)^2,\quad (-15)^2,\quad (20)^2 = 25,\quad 225,\quad 400\]
Step 3: Add the squared differences: \(25 + 225 + 400 = 650\)
Step 4: Square root: \(d(A,B) = \sqrt{650} \approx 25.5\)
Notice that Spending Score (difference of 20) and Income (difference of 15) contribute much more than Age (difference of 5). Features with larger numbers dominate the distance — this is why we need standardization!
Other Distance Measures
| Distance | Formula | When to use |
|---|---|---|
| Euclidean | \(\sqrt{\sum (a_i - b_i)^2}\) | Default; continuous features |
| Manhattan | \(\sum \lvert a_i - b_i \rvert\) | Robust to outliers; grid-like data |
| Correlation | \(1 - \text{corr}(A,B)\) | When shape matters, not magnitude |
Different distance metrics can produce very different clusters! Always think about what “similar” means for your business problem.
Why Standardize?
The picture below is the clearest motivation. The raw data has two real clusters separated top vs bottom — a horizontal cut would correctly divide them. But the \(x\)-axis is much wider than the \(y\)-axis, so a small horizontal gap of \(\sim 0.4\) contributes far more to Euclidean distance than the larger vertical gap of \(\sim 0.5\). K-Means follows the dominant axis and cuts the data left vs right with a vertical boundary — exactly the wrong split. After standardizing each feature to the same scale, the vertical gap finally counts as much as the horizontal one, and K-Means recovers the correct top/bottom split.
Problem: Without standardizing, the axis with larger numbers dominates K-Means.
Solution: \(z_i = \dfrac{x_i - \bar{x}}{s_x}\) (mean \(=0\), std \(=1\))
Interpretation: After StandardScaler, the mean of each column is 0.00 and the standard deviation of each column is 1.00 (up to numerical precision). The customer who was 45 years old is now +1.28 standard deviations above the mean age. The customer who earned $40,000 is now −1.25 standard deviations below the mean income. Now when you compute Euclidean distance between two customers, each feature contributes proportionally to how unusual the difference is — not proportionally to the raw units. A 5-year age difference and a $12,000 income difference each contribute roughly the same amount to the total distance, which is a far more defensible definition of “similar.”
StandardScaler is the default choice for clustering. An alternative is MinMaxScaler, which maps each feature to the range \([0, 1]\) instead of standardizing to zero mean and unit variance. Min-max scaling is sensitive to outliers (one extreme value compresses everything else into a small range), so z-score standardization is generally preferred for clustering unless you have a specific reason to prefer bounded features — for example, when feeding data to a neural network with bounded activation functions.
Always standardize before clustering so every feature gets an equal vote.
K-Means Clustering
Background: The K-Means Algorithm
K-Means is deceptively simple. The entire algorithm reduces to two alternating steps repeated until nothing changes: assign each point to its nearest centroid, then update each centroid to the mean of its members. This assign-update cycle is guaranteed to terminate — the objective function (WCSS, or Within-Cluster Sum of Squares) strictly decreases at each step and is bounded below by zero, so the algorithm must eventually converge.
The convergence guarantee comes with an important caveat: K-Means converges to a local minimum, not necessarily the global one. The final clustering depends on which \(K\) points were chosen as initial centroids. Two runs of the algorithm on the same data with different random seeds can produce different final clusters, each locally optimal but potentially far from each other in quality. This is why scikit-learn’s default is n_init=10: run the algorithm 10 times with different random initializations and keep the best result (the one with the lowest WCSS). For important analyses, increase n_init to 20 or 50.
A more sophisticated initialization strategy called K-Means++ (Arthur and Vassilvitskii, 2007) chooses initial centroids that are spread far apart from each other, dramatically reducing the probability of landing in a poor local minimum. Scikit-learn uses K-Means++ initialization by default (init='k-means++'), which is why you rarely need to increase n_init beyond 10 in practice.
Why does K-Means always converge? Each iteration can only change the cluster assignment of a point if a different centroid is now closer. Every such reassignment reduces WCSS (or leaves it unchanged). Since there are only finitely many possible partitions of \(n\) points into \(K\) groups, the algorithm cannot cycle and must stop. The formal proof uses the fact that WCSS is a Lyapunov function for the algorithm — a non-negative quantity that never increases — which is a standard tool in optimization theory.
The connection to Gaussian Mixture Models: K-Means is actually a special (hard-assignment, equal-variance) case of the Expectation-Maximization (EM) algorithm for fitting Gaussian Mixture Models (GMMs). Where K-Means assigns each point to exactly one cluster (hard assignment), a GMM assigns each point a probability of belonging to each cluster (soft assignment). If you need soft boundaries between clusters — for example, customers who are ambiguously between two segments — GMMs are the natural extension.
For most business applications with \(n < 100{,}000\) observations, K-Means with n_init=10 and K-Means++ initialization (sklearn’s default) is robust and fast. For very large datasets (\(n > 1{,}000{,}000\)), Mini-Batch K-Means (sklearn.cluster.MiniBatchKMeans) processes random subsets at each iteration, achieving comparable quality at a fraction of the computational cost.
K-Means: The Algorithm
- Choose \(K\) initial centroids (randomly)
- Assign each point to the nearest centroid
- Recompute centroids as cluster means
- Reassign points to new nearest centroids
- Repeat steps 2–4 until no points change cluster
The algorithm alternates between assignment and update. It always converges, though possibly to a local optimum.
What Is a Centroid?
A centroid is simply the average position of all points in a cluster — the “center of gravity.”
Analogy: Imagine balancing a cardboard cutout on your fingertip. The balance point is the centroid.
Numerical example:
| Age | Income | |
|---|---|---|
| Customer 1 | 20 | 30 |
| Customer 2 | 30 | 50 |
| Customer 3 | 40 | 40 |
| Centroid | 30 | 40 |
\[\text{Centroid} = \left(\frac{20+30+40}{3},\; \frac{30+50+40}{3}\right) = (30, 40)\]
The centroid is the mean of all points in the cluster. K-Means works by repeatedly moving centroids to better represent their members.
K-Means Step 1: Initialize Random Centroids
How it starts:
- We choose \(K\) (e.g., \(K=3\))
- The algorithm picks 3 random points as initial “centroids”
- All data points are currently unassigned
Different random starting positions can lead to different final clusters! That is why n_init=10 runs the algorithm 10 times and keeps the best result.
K-Means Step 2: Assign Each Point to Nearest Centroid
Assignment rule: For each data point, compute the Euclidean distance to every centroid. Assign the point to the closest one.
Example: Point \(P=(1.0, 4.0)\)
- \(d(P, \mu_1) = \sqrt{0 + 1} = 1.0\)
- \(d(P, \mu_2) = \sqrt{4 + 4} = 2.8\)
- \(d(P, \mu_3) = \sqrt{2.25 + 0.25} = 1.6\)
\(\Rightarrow\) Closest is \(\mu_1\), so \(P\) joins the red cluster.
After assigning all points, each one is coloured by its cluster.
K-Means Step 3: Recompute Centroids (Cluster Averages)
Update rule: compute the mean of all member points \(\rightarrow\) new centroid.
Red cluster: \((0.5,1.5),\;(1.0,2.0),\;(0.8,1.2),\;(1.5,1.8),\;(1.2,1.0)\)
\[\mu_1 = \left(\frac{0.5+1.0+0.8+1.5+1.2}{5},\;\frac{1.5+2.0+1.2+1.8+1.0}{5}\right) = (1.0,\,1.5)\]
Centroid moved from \((1.0, 3.0)\) to \((1.0, 1.5)\)!
⭐ Key: This is why it is called K-Means — centroids are the mean of their members.
K-Means Step 4: Repeat Until Convergence
Now repeat Steps 2–3:
- Re-assign each point to the nearest new centroid
- Recompute centroids again
- Repeat until no points change cluster
Typically converges in 10–20 iterations.
⭐ The full K-Means loop:
Assign \(\rightarrow\) Update \(\rightarrow\) Assign \(\rightarrow\) Update \(\rightarrow\) … until nothing changes.
K-Means in Python
n_clusters: number of clusters \(K\) (you choose)random_state: reproducibility seedn_init=10: run algorithm 10 times, keep best
Interpretation: The output shows cluster sizes (approximately 50 members each, since we simulated three equal groups) and a WCSS value. The cluster sizes tell you whether K-Means found balanced groups or highly uneven ones — very uneven sizes (e.g., one cluster with 140 members and two with 5 each) are a warning sign that \(K\) may be wrong, or that the data has outliers pulling centroids. The WCSS value on its own is not informative; it only becomes useful when compared across different values of \(K\) in the Elbow Plot.
Pitfall: comparing cluster labels across runs. The cluster numbers K-Means assigns (0, 1, 2, …) are arbitrary. If you run K-Means twice on the same data, “Cluster 0” in run 1 might correspond to “Cluster 2” in run 2. Never compare cluster labels across different runs without first aligning them. This also means you cannot interpret cluster numbers as an ordering (Cluster 1 is not “better” than Cluster 2 in any sense).
Pitfall: outliers dragging centroids. A single extreme customer with income $2,000,000 will pull the centroid of whichever cluster it joins far away from the other members, producing a large but sparse cluster. Always check for outliers before clustering and consider whether to remove or Winsorize them.
Choosing \(K\): The Elbow Plot
WCSS (Within-Cluster Sum of Squares): \[\text{WCSS} = \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2\]
- WCSS always decreases as \(K\) increases
- Look for the “elbow” — the point where adding more clusters gives diminishing returns
Interpretation: The elbow plot shows WCSS on the vertical axis and \(K\) on the horizontal axis. As \(K\) increases from 1 to 10, WCSS always decreases — with more centroids, every point is closer to its nearest center. The useful information is in the rate of decrease. From \(K=1\) to \(K=2\), WCSS drops steeply because splitting one large cluster into two captures a real structure in the data. From \(K=2\) to \(K=3\), another big drop. From \(K=3\) to \(K=4\), the drop is much smaller. The “elbow” at \(K=3\) is the point of diminishing returns — adding a fourth cluster only marginally reduces within-cluster variance, which is not worth the added complexity.
When the elbow is ambiguous — the curve looks smooth with no sharp bend — consider using the silhouette score (higher is better; measures how much tighter a point is to its own cluster than to its nearest neighbor cluster) or the gap statistic (compares WCSS to a null reference distribution) as supplementary criteria. In practice, domain knowledge often resolves ambiguity: if you have a marketing budget for four campaigns, \(K=4\) is the natural choice even if the elbow is at \(K=3\).
Pitfall: picking \(K\) to match aesthetic preference. A common mistake is to pick \(K=5\) because “five segments feels like a round number” or \(K=3\) because “three is the classic Good/Better/Best framework.” Always ground your \(K\) choice in the elbow plot (or silhouette score) first, then verify the chosen \(K\) makes business sense. The algorithm and the business judgment should converge on the same answer.
WCSS: What Does It Actually Measure?
WCSS stands for Within-Cluster Sum of Squares. In plain language:
- For each cluster, measure how far each point is from the cluster’s centroid
- Square those distances (so bigger gaps count more)
- Add them all up across every cluster
Lower WCSS = tighter clusters = better fit.
Think of it like this: WCSS measures how “spread out” the points are within their assigned groups. If every point sat exactly on its centroid, WCSS would be zero (perfect, but unrealistic).
Imagine a centroid star with dashed lines connecting it to each member point. WCSS is the sum of the squared lengths of all those dashed lines.
The Elbow Plot shows WCSS for different values of \(K\). As you add more clusters, WCSS always goes down (more centroids = shorter distances). The “elbow” is where the improvement slows dramatically — adding more clusters is not worth the extra complexity.
Try It! — K-Means on Customer Data
Challenge: You have a dataset with 200 mall customers (Age, Annual Income, Spending Score).
- Standardize the features using
StandardScaler - Create an Elbow Plot for \(K = 1, 2, \ldots, 10\)
- Apply K-Means with your chosen \(K\)
- Add the cluster labels to the DataFrame and create a scatter plot of Income vs. Spending Score, coloured by cluster
Hint: Look at the notebook Section 5 for the full walkthrough!
K-Means: Pitfalls and Limitations
1. Sensitive to initialization Different random starts → different clusters. Fix: n_init=10 (run 10 times, keep best).
2. Sensitive to outliers Extreme points pull centroids away from true centers.
3. Assumes spherical clusters Fails on elongated or crescent-shaped groups.
4. Must choose \(K\) in advance Elbow plot helps, but it is a judgment call.
K-Means struggles with: (i) two interlocking crescent-shaped clusters, (ii) an extreme outlier that drags the centroid toward itself.
⚠️ Always visualise your clusters and check if they make business sense!
Hierarchical Clustering
Background: Hierarchical Clustering
K-Means is a greedy partitioner: you tell it \(K\), it divides the data into \(K\) groups and stops. If you want to see what the data looks like with \(K=3\) and \(K=5\) and \(K=7\), you must run it three separate times. Hierarchical clustering takes a different philosophy: run the algorithm once and obtain a complete tree of all possible partitions, from \(N\) singletons (every point in its own cluster) to 1 giant group (everything in one cluster). After fitting, you choose \(K\) by cutting the tree at any height.
The agglomerative (bottom-up) approach starts with \(N\) clusters and merges them one pair at a time. At each step, the algorithm finds the two closest clusters and merges them. This merge is recorded as a node in the dendrogram at a height equal to the distance between the two clusters. After \(N-1\) merges, all points are in one cluster and the tree is complete.
The divisive (top-down) approach works in the opposite direction: start with all points in one cluster and recursively split. Divisive clustering is less commonly used in practice — it is computationally expensive and the splits are harder to define — so this chapter focuses on agglomerative methods.
Linkage methods define how you measure the distance between two clusters (not two points). This choice matters enormously:
- Single linkage uses the minimum distance between any two points across the two clusters. It is fast to compute but produces the “chaining” problem: a cluster can grow by absorbing one point at a time, creating long, stringy chains rather than compact groups.
- Complete linkage uses the maximum distance. It produces compact, roughly equal-sized clusters but is sensitive to outliers (one outlier in a cluster inflates the inter-cluster distance).
- Average linkage uses the mean of all pairwise distances — a reasonable compromise.
- Ward’s method (Joe Ward, 1963) asks: if we merge these two clusters, how much does total WCSS increase? It merges the pair that produces the smallest WCSS increase. Ward’s is the default for most business applications because it produces the most balanced, well-separated clusters and corresponds to the same objective function as K-Means.
The dendrogram is the visual output of hierarchical clustering. Each leaf at the bottom represents one data point. Moving up the tree, you see merges happening at increasing heights. The height of a merge reflects how dissimilar the two clusters were when they joined. Tall vertical lines before a merge mean the two groups were far apart — a clear separation. Short lines mean the groups were similar and the split at that level may not be meaningful.
Hierarchical clustering scales as \(O(n^2 \log n)\) in time and \(O(n^2)\) in memory (it must store the full distance matrix). For datasets with \(n > 10{,}000\) observations, K-Means is typically more practical. Hierarchical clustering is the preferred choice for \(n < 1{,}000\) when you want to explore the full cluster structure without committing to a specific \(K\) in advance — common in exploratory research, financial analysis, and genomics.
Hierarchical Clustering: Bottom Up
Picture three stages:
- Start: \(N\) clusters — each point A, B, C, D is its own cluster
- Merge closest — A and B join into one cluster
- Merge again — {A, B} joins with C into {A, B, C}; D remains separate
Agglomerative algorithm:
- Start with each data point as its own cluster (\(N\) clusters)
- Compute distance between all pairs of clusters
- Merge the two closest clusters
- Repeat steps 2–3 until only 1 cluster remains
Unlike K-Means, hierarchical clustering does not require you to pre-specify \(K\). You choose \(K\) after by cutting the dendrogram.
How to Measure Cluster Distance? — Linkage Methods
| Linkage | Distance between clusters | Characteristics |
|---|---|---|
| Single | Min distance between any two points | Can create long “chaining” clusters |
| Complete | Max distance between any two points | Tends to produce compact clusters |
| Average | Mean distance between all point pairs | Compromise between single & complete |
| Ward | Increase in total within-cluster variance | Most balanced, compact clusters |
Ward linkage tends to give the most balanced, well-separated clusters. It is the recommended default for most business applications.
Computing Average Linkage: A Worked Example
Suppose we want to merge cluster \(\{A, B\}\) with cluster \(\{C, D\}\) using Average Linkage.
Average linkage = mean of all pairwise distances between the two clusters.
| Pair | Distance |
|---|---|
| \(d(A, C)\) | 5 |
| \(d(A, D)\) | 9 |
| \(d(B, C)\) | 4 |
| \(d(B, D)\) | 7 |
| Average | \(\frac{5+9+4+7}{4} = \mathbf{6.25}\) |
There are \(2 \times 2 = 4\) pairs between the two clusters, so we average all 4 distances.
Compare with other linkages:
- Single = \(\min(5,9,4,7) = 4\)
- Complete = \(\max(5,9,4,7) = 9\)
- Average = \(6.25\) (in between!)
Hierarchical Clustering: Merge Example with 5 Points
Distance matrix (Euclidean):
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 0 | 2 | 6 | 10 | 9 |
| B | 2 | 0 | 5 | 9 | 8 |
| C | 6 | 5 | 0 | 4 | 5 |
| D | 10 | 9 | 4 | 0 | 3 |
| E | 9 | 8 | 5 | 3 | 0 |
Step-by-step merges (single linkage — use minimum distance):
- Merge A & B (distance = 2, the smallest)
- Merge D & E (distance = 3, next smallest)
- Merge C with {D,E} (distance = min(5,4,5) = 4)
- Merge {A,B} with {C,D,E} (distance = min(6,5,…) = 5)
The resulting mini-dendrogram has merges at heights 2, 3, 4, and 5.
The Dendrogram: Reading the Tree
A dendrogram tree:
- A and B merge at height 1
- D and E merge at height 1.5
- {A,B} joins C at height 3
- {A,B,C} joins {D,E} at height 5.5
- A red dashed cut line at height 2.5 gives K = 3
How to read:
- Leaves (bottom) = individual data points
- Height of a merge = distance at which two clusters joined
- Cut the tree horizontally to get \(K\) clusters
- Cutting at height 2.5 gives 3 clusters: {A,B}, {C}, {D,E}
Tall vertical lines before a merge mean those clusters are well separated. Short lines mean the groups are similar.
Dendrogram in Python
scipy gives you the dendrogram visualization; sklearn gives you the cluster labels directly.
Interpretation: The dendrogram height where two branches merge tells you how different those groups were when they joined. In this simulated dataset (three well-separated groups), you should see two early merges at low heights — points within the same true cluster joining each other — followed by a large gap before the inter-group merges at much higher heights. The correct \(K\) is the number of branches below the largest gap. If the tallest vertical lines you can draw without crossing a horizontal merge line spans three branches, then \(K=3\). The red cut line shows where you slice the tree horizontally to obtain your chosen number of clusters.
The AgglomerativeClustering output (cluster labels 0, 1, 2 …) is directly analogous to K-Means labels — you can add them to a DataFrame and use groupby to profile each cluster exactly as in Section 2.
Pitfall: choosing the wrong linkage for your data’s shape. Single linkage is prone to chaining — it will happily merge two compact groups through a thin bridge of outlier points. Ward linkage assumes roughly spherical, equal-variance clusters. For data with elongated clusters or very unequal cluster sizes, average linkage or complete linkage may be more appropriate. Visualize the dendrogram and the resulting scatter plot to check that the clusters look sensible.
Try It! — Hierarchical vs. K-Means
Challenge: Using the same customer dataset:
- Build a dendrogram using Ward linkage
- Cut it to get 5 clusters
- Compare the cluster assignments to those from K-Means (\(K=5\))
- Do the two methods agree? On how many customers do they disagree?
Hint: Use scipy.cluster.hierarchy.fcluster(Z, t=5, criterion='maxclust')
K-Means vs. Hierarchical: When to Use Which?
Decision flowchart:
- Know \(K\)?
- No → Hierarchical (explore tree)
- Yes → next question
- Large data (\(n > 1000\))?
- Yes → K-Means (faster)
- No → Try both!
Practical decision rules:
- K-Means: fast, simple, works well with round, evenly sized clusters
- Hierarchical: see the full merging structure; decide \(K\) after
- When in doubt: run both and compare — if they agree, more confidence in your clusters
⭐ Bottom line: No single “correct” method. Try both, compare, and let the data guide you.
Case Study: Forbes Financial Data
Background: Clustering for Industry Classification
One of the most compelling validations of clustering in finance is that financial ratios encode industry membership. Companies in the same industry face similar competitive dynamics, capital structures, growth opportunities, and accounting conventions — and these similarities show up systematically in the numbers. A chemical company and a pharmaceutical company both have high capital expenditures and research spending, but the chemical company has lower margins and a more cyclical revenue profile. A grocery retailer has thin margins, high asset turnover, and a very different payout ratio from a healthcare firm.
If clustering can recover industry groups from financial data alone — without ever seeing the industry labels — that validates two things simultaneously: (1) the clustering algorithm is working correctly, and (2) the financial features we chose actually carry industry-relevant information. This is a form of external validation: we compare the algorithm’s output to a ground-truth labeling we withheld.
The Forbes financial dataset is a classic teaching example for this reason. It contains 25 companies from three industries (Chemical, Healthcare, Grocery) described by seven financial metrics. The question is: given only the numbers, can Ward’s hierarchical clustering recover the three industries? The answer is: mostly yes, with the interesting exceptions being the “financially anomalous” companies that are outliers within their own industry — they look like a different industry on paper.
This exercise also introduces an important idea: clustering is not always about discovering unknown structure. Sometimes you already know the groups and you use clustering to ask, “are these groups financially distinguishable?” The cross-tabulation of cluster assignments vs. known industry labels is a rigorous answer to that question.
Financial institutions use clustering for industry classification as a complement to SIC (Standard Industrial Classification) and NAICS codes. Official industry codes are self-reported and sometimes outdated; financial-ratio clustering can reveal that a “conglomerate” has a financial profile closer to a utilities company, or that a “technology” company actually behaves like a services firm. This has implications for peer group selection in equity valuation, benchmark construction, and risk factor attribution.
Forbes Data: Overview
Dataset: 25 Forbes-listed companies
Variables (7 financial metrics):
- PE ratio, 5-year return on revenue
- Debt-to-equity ratio
- 5-year sales growth, 5-year EPS growth
- Net profit margin, payout ratio
Known industries: Chemical (14), Healthcare (5), Grocery (6)
Can the algorithm discover industry groups without knowing the labels?
# Example: load a real CSV from the book's data folder
# Forbes = pd.read_csv("data/forbes_companies.csv")
# X = Forbes.iloc[:, 3:] # numeric columns only
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)Forbes: Building the Dendrogram
- We use Ward linkage (minimises within-cluster variance) — the recommended default
- The red dashed line shows where we cut to get \(K=3\) clusters
- We then compare the algorithm’s clusters to the known industry labels
Interpretation: The cross-tabulation (pd.crosstab) is the key output. A perfect recovery would show each cluster mapping exactly to one industry. In practice, you expect near-perfect recovery for the most financially distinctive industry (Healthcare, with its uniquely high PE ratios and margins) and some overlap between the two more similar industries (Chemical and Grocery). “Misclassified” companies are not errors — they are the most interesting data points. A chemical company that clusters with grocers has a financial profile more characteristic of a low-margin, high-turnover business. That is a finding worth investigating.
The dendrogram also reveals within-industry structure. Among the 14 chemical companies, some merge early (at low height) into tight sub-clusters of 2–3 firms, while others are more isolated. These sub-clusters may correspond to sub-sectors (e.g., specialty chemicals vs. commodity chemicals) that the algorithm discovered without any labels.
Pitfall: clustering on irrelevant features. The Forbes analysis uses seven financial metrics that are known to be informative about industry membership. If you added irrelevant features — say, the CEO’s age, or the stock price on a random day — you would introduce noise that distorts distances and degrades cluster quality. Feature selection matters as much in clustering as in supervised learning. Include only features that you believe carry information about the natural groupings you are trying to discover.
Pitfall: assuming clusters are “correct”. The K=3 clustering of Forbes companies is useful — it tells us Healthcare is financially distinct from Chemical and Grocery — but it is not “true” in any deep sense. A K=4 solution might split Chemical into specialty and commodity sub-groups, which could be equally valid. Clustering is a tool for description and exploration, not for establishing ground truth. The business analyst’s judgment about which solution is most actionable is part of the method.
Forbes: The Dendrogram
Reading the Forbes dendrogram (from the previous slide’s plot):
- Companies that merge at low height are very similar (e.g., chemical firms with comparable PE and payout ratios)
- The red dashed line cuts the tree into 3 clusters
- Read bottom to top: which companies group together?
Trace specific company codes (C1, C2, …, H1, H2, …, G1, G2, …) and note that the healthcare firms (H*) tend to form their own branch high up — they look very different from the chemicals and groceries on financial metrics alone.
Forbes: Interpretation and Cross-Tab
Cluster vs. Industry cross-tabulation:
| Industry | C1 | C2 | C3 |
|---|---|---|---|
| Chemical (14) | 0 | 11 | 3 |
| Grocery (6) | 0 | 4 | 2 |
| Healthcare (5) | 4 | 0 | 1 |
Reading the table:
- C1 = almost pure Healthcare (4 of 5)
- C2 = mostly Chemical + Grocery
- C3 = mixed (3 Chem + 2 Groc + 1 Heal)
Key insights:
- Healthcare is the most distinct — high PE, high growth, high margins
- Chemical and Grocery overlap — similar margins and payout ratios
- “Misclassified” companies have financial profiles resembling another industry
⭐ Takeaway: Clustering recovered meaningful industry groups from financial data alone — companies in the same industry share “financial DNA.”
Case Study: Customer Segmentation
Background: The Marketing Science of Segmentation
Customer segmentation is the oldest application of clustering in business. Long before the term “data science” existed, marketers used demographic surveys and purchase records to classify customers into groups and tailor campaigns accordingly. What has changed is scale and precision: modern companies have behavioral data on millions of customers, and clustering algorithms can extract segments that are far more nuanced than the traditional demographic buckets (age group, income bracket, gender).
The foundational insight is the 80/20 rule (Pareto principle): in most retail businesses, roughly 20% of customers generate 80% of revenue. Knowing which 20% — and more importantly, what they look like so you can find more like them — is enormously valuable. Clustering is one of the primary tools for answering this question.
But the value of segmentation extends far beyond identifying the top customers. Each segment has its own price sensitivity, preferred communication channel, motivational triggers, and lifetime value trajectory. A “Budget Shopper” who visits frequently but spends little per visit may have enormous lifetime value if you can convert them to a higher-spending segment with the right loyalty program. An “Impulsive Buyer” with a high spending score but modest income may be vulnerable to economic downturns — a credit risk for BNPL (Buy Now Pay Later) programs.
The three-feature mall customer dataset (Age, Annual Income, Spending Score) is deliberately simplified for teaching purposes. In practice, a retail bank’s customer segmentation model might include hundreds of behavioral features: transaction frequency, average transaction size, product categories purchased, days since last purchase, response rate to past campaigns, customer service contact history, digital engagement metrics. The mechanics are identical — standardize, cluster, profile — but the interpretation requires deeper domain expertise.
Telecom customer churn segmentation is a canonical industry application. A telecom company clusters customers by usage patterns, contract tenure, complaint history, and plan type. One segment — typically younger, month-to-month contract, heavy data user — has 3–4× the churn rate of the overall population. By identifying this segment proactively, the retention team can offer targeted upgrades or discounts before the customer decides to leave, converting a reactive “save” into a proactive retention. Studies suggest proactive retention of high-churn segments costs 5–10× less than acquiring an equivalent new customer.
Mall Customer Segmentation: Business Context
Scenario: A shopping mall wants to understand its customers to target marketing.
Variables:
- 💻 Age (18–70)
- 💻 Annual Income ($15k–$140k)
- 💻 Spending Score (1–100)
Goal: Segment 200 customers into meaningful groups for targeted marketing campaigns.
A 2D map of (Income, Spending) showing five blobs in different corners — high-income/high-spending VIPs, high-income/low-spending careful spenders, low-income/high-spending impulsive buyers, low-income/low-spending budget shoppers, and middle-of-the-pack average customers.
Mall Customers: Elbow Plot
The elbow at \(K=5\) suggests five natural customer segments.
Interpretation: The steep drop from \(K=1\) to \(K=5\) and then the much flatter curve beyond \(K=5\) confirms that five is the natural number of segments in this data. From \(K=5\) to \(K=6\), WCSS drops by only a small amount relative to earlier steps — the sixth cluster is subdividing an already well-defined group, not capturing a new natural grouping. In a real dataset the elbow might be less sharp. When the elbow plot is ambiguous between \(K=4\) and \(K=5\), a useful sanity check is to run both and ask: do the two extra segments in \(K=5\) have meaningfully different profiles and require different marketing strategies? If yes, use 5. If the two extra segments look essentially the same, use 4.
Mall Customers: K-Means with K=5
The plot shows five clearly separated groups in (Income, Spending) space — one in each “corner” plus one in the middle.
Interpretation: The scatter plot immediately reveals the business-relevant structure: the five segments occupy distinct regions of (Income, Spending Score) space. This geometric separation is what makes the clustering actionable — each segment occupies a different “quadrant” of the customer behavior space. Notice that the cluster in the upper-right corner (high income, high spending) is the smallest by count but highest in value per customer. The cluster in the lower-left (low income, low spending) is probably the largest by count but lowest in value. The “average” cluster in the middle has the most variance — it is the catch-all for customers who do not fit neatly into any corner.
Pitfall: forgetting that clustering is for description, not prediction. The cluster labels assigned by K-Means do not predict whether a new customer will be a “VIP” — they describe the current customer base. To predict segment membership for new customers, you need a supervised classifier (e.g., logistic regression or decision tree) trained on the cluster labels as \(Y\). Clustering produces the labels; supervised learning uses them.
Mall Customers: Naming the Segments
| Cluster | Name | Income | Score | Marketing Strategy |
|---|---|---|---|---|
| 1 | VIP Clients | High | High | Loyalty programs, exclusive events |
| 3 | Impulsive Buyers | Low | High | Affordable premium, BNPL |
| 0 | Average Customers | Mid | Mid | Newsletters, general promos |
| 2 | Careful Spenders | High | Low | Personalized outreach, rewards |
| 4 | Budget Shoppers | Low | Low | Discount bundles, value packs |
Clustering transforms raw data into actionable business segments. Each cluster gets a tailored strategy — that is the real value of unsupervised learning.
Profiling Clusters with groupby
How do we know what each cluster “looks like”? Use groupby to compute the mean of each feature per cluster — the same pandas skill from Topic 2!
Example output (numbers vary slightly by seed):
| Cluster | Avg Age | Avg Income ($k) | Avg Spending Score |
|---|---|---|---|
| 0 | 42.7 | 55.3 | 49.5 |
| 1 | 32.7 | 86.5 | 82.1 |
| 2 | 41.1 | 88.2 | 17.1 |
| 3 | 25.3 | 25.7 | 79.4 |
| 4 | 45.2 | 26.3 | 20.9 |
The cluster profile table is how you name your segments. Cluster 1 has high income and high spending \(\Rightarrow\) “VIP.” Cluster 4 has low income and low spending \(\Rightarrow\) “Budget.” The numbers tell the story!
Interpretation: The profile table is the bridge from algorithm output to business action. Reading across each row, you reconstruct a “persona” for each cluster. Cluster 1 (high income \(\approx \$87k\), high spending score \(\approx 82\), relatively young \(\approx 33\)) is the VIP segment: affluent, engaged, and likely already loyal. The marketing strategy is retention: premium loyalty rewards, early access to new products, exclusive events. Cluster 4 (low income \(\approx \$26k\), low spending score \(\approx 21\), older \(\approx 45\)) is the Budget segment: value-sensitive, infrequent high-value purchases. The strategy is conversion: discount bundles, value-pack promotions, savings messaging.
The centroid coordinates (the cluster means) are the mathematical definition of the “average member” of each segment. They are the most useful single summary of a cluster for business communication: instead of describing a distribution, you describe a representative persona.
Naming segments is as much art as science. Common industry naming conventions: Transactors / Revolvers / Dormant (banking); Champions / Loyal / At-Risk / Lost (RFM marketing); Prime / Standard / Subprime (credit); Early Adopters / Mainstream / Laggards (technology adoption). The names should be memorable, non-pejorative (avoid calling any segment “bad”), and actionable — the name itself should suggest the strategy.
Extension: Try mall.groupby('Cluster').describe() to see not just the mean, but also min, max, and standard deviation for each cluster. Which cluster has the most variation?
Hierarchical vs. K-Means agreement. In the Try It! challenge from Section 3, you compared hierarchical clustering (Ward, K=5) to K-Means (K=5) on the same customer data. When both methods agree on the assignment of most customers, you have strong evidence that the clusters are real structure in the data, not artifacts of one algorithm. When they disagree on many customers, the cluster boundary in that region is ambiguous — worth investigating whether those customers belong to a distinct segment or simply live in an overlapping zone between two segments.
Summary
Chapter Summary: What You Have Learned
This chapter introduced the two foundational algorithms of unsupervised learning — K-Means and hierarchical clustering — and demonstrated their application to two real-world business problems. Here is what you should take away.
The unsupervised learning landscape. Clustering is the most widely deployed unsupervised method, but it sits within a broader family:
- K-Means and K-Medoids: partition-based; fast; require \(K\) in advance
- Hierarchical (agglomerative): tree-based; no \(K\) required upfront; produces dendrogram
- DBSCAN (Density-Based Spatial Clustering): finds clusters of arbitrary shape; handles noise/outliers explicitly; particularly useful for geographic data
- Gaussian Mixture Models (GMM): probabilistic, soft-assignment version of K-Means; useful when cluster boundaries are fuzzy
- Spectral Clustering: uses the eigenstructure of the similarity matrix; handles non-convex clusters; computationally expensive for large \(n\)
For most business applications, K-Means or Ward hierarchical clustering will get you 90% of the way there. Start simple.
The 5-step clustering workflow:
- Collect — gather relevant features that you believe encode group membership
- Standardize — apply
StandardScalerso every feature has equal weight in distance calculations - Choose \(K\) — use the Elbow Plot for K-Means; use the dendrogram height gap for hierarchical
- Cluster — fit the algorithm; for K-Means, use
n_init=10and a fixedrandom_state - Interpret — use
groupbyto profile clusters; name segments; connect to business actions
Step 5 is where most of the value is created — and where most of the effort should go. The algorithm does the math; you provide the judgment.
Decision flowchart: which method?
- \(n > 10{,}000\) and speed matters → K-Means
- \(n < 1{,}000\) and you want to explore the full structure → Hierarchical (Ward)
- You do not know \(K\) and want to discover it → Hierarchical (cut the dendrogram)
- You want soft cluster memberships → Gaussian Mixture Model
- Clusters may be non-spherical or of very different densities → DBSCAN
- In doubt → run both K-Means and hierarchical; compare results
What’s next. Clustering operates in the original feature space. When you have many features (say, 50 financial ratios), the distance calculations become noisy — in high dimensions, all points tend to be roughly equidistant from each other (the “curse of dimensionality”). The solution is dimensionality reduction: project the data into a lower-dimensional space that preserves the essential variation. Principal Component Analysis (PCA) is the linear version — it finds the directions of maximum variance and projects the data onto the first \(k\) principal components. t-SNE and UMAP are nonlinear alternatives that preserve local neighborhood structure, commonly used for visualizing high-dimensional clusters. These topics are the natural next step after mastering the algorithms in this chapter.
Further reading. The canonical references for the material in this chapter are:
- The Elements of Statistical Learning (Hastie, Tibshirani, and Friedman, 2009) — Chapter 14 covers unsupervised methods rigorously, including K-Means, hierarchical clustering, and principal components. Freely available at https://web.stanford.edu/~hastie/ElemStatLearn/
- An Introduction to Statistical Learning (James, Witten, Hastie, and Tibshirani, 2021) — Chapter 12 covers the same material at a more accessible level, with R code examples. Freely available at https://www.statlearning.com/
Final reflection. Every algorithm in this chapter is a mathematical object: it optimizes a criterion, converges to a fixed point, produces a partition. But the hard work — deciding which features to cluster on, choosing \(K\), naming the segments, and designing the marketing strategy for each — requires human judgment, domain knowledge, and creativity that no algorithm possesses. The machine finds the pattern; you decide what it means and what to do about it. That is the fundamental reason why data science skills amplify rather than replace business judgment.
Clustering is where mathematics meets business judgment. The algorithm finds the groups; you name them and act on them. Neither step works without the other.
Supervised vs. Unsupervised Learning
| Supervised | Unsupervised | |
|---|---|---|
| (Topics 1–3) | (Topic 4) | |
| Goal | Predict \(Y\) | Find groups |
| Labels? | Yes (need \(Y\)) | No (no target) |
| Methods | Regression, classification | Clustering, PCA |
| Evaluation | \(R^2\), MSE, accuracy | Inertia, silhouette, domain expertise |
| Example | Predict house price | Segment customers |
K-Means vs. Hierarchical Clustering
| K-Means | Hierarchical | |
|---|---|---|
| Choose \(K\)? | Before running | After (cut dendrogram) |
| Speed | Fast (large datasets) | Slower (\(O(n^2)\) memory) |
| Visualization | Scatter plot | Dendrogram |
| Shape | Spherical clusters | Any shape |
| Reproducibility | Depends on init | Deterministic |
| Best for | Large \(n\), clear groups | Small \(n\), exploring structure |
Use K-Means for speed and simplicity. Use Hierarchical for richer exploration and when you are unsure about \(K\). In practice, try both and compare!