Hierarchical Clustering builds a hierarchy of clusters by either a bottom-up (agglomerative) or top-down (divisive) approach. This visualization focuses on the more common agglomerative method.
Agglomerative Clustering Algorithm:
- Start: Each data point begins as its own cluster
- Merge: At each step, the two most similar clusters are merged into a single cluster
- Update: Recalculate distances between the new cluster and all other clusters
- Repeat: Continue merging clusters until only one cluster remains (or until desired cluster count is reached)
Linkage Methods:
- Single Linkage: Distance between the closest points of two clusters (minimum distance)
- Complete Linkage: Distance between the farthest points of two clusters (maximum distance)
- Average Linkage: Average distance between all pairs of points in the two clusters
- Ward's Method: Minimizes the increase in variance for the clusters being merged
Dendrogram:
- A tree diagram that shows the hierarchical relationship between clusters
- Horizontal position of merge points represents the distance at which clusters were merged
- Cutting the dendrogram at a particular height produces a specific number of clusters
Advantages:
- No need to specify the number of clusters in advance
- Produces a dendrogram that provides insights into data structure
- Can identify hierarchical relationships between clusters
Limitations:
- Computationally intensive for large datasets (O(n²) space complexity, O(n³) time complexity in worst case)
- Sensitive to noise and outliers
- Different linkage criteria can yield very different results