t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique that visualizes high-dimensional data by giving each datapoint a location in a 2D or 3D map. It excels at preserving local structure, making it ideal for revealing clusters and patterns.
The t-SNE Algorithm:
- Compute Pairwise Affinities: Convert distances between points in the high-dimensional space into conditional probabilities, with closer points having higher probabilities
- Initialize Low-Dimensional Embedding: Start with a random arrangement of points in the lower-dimensional space
- Compute Low-Dimensional Affinities: Calculate similar probabilities in the low-dimensional space using a t-distribution (heavier tails than Gaussian)
- Minimize KL Divergence: Use gradient descent to minimize the difference between the high and low-dimensional probability distributions
- Early Exaggeration: Initially multiply the high-dimensional probabilities to create wider gaps between natural clusters
Key Parameters:
- Perplexity: Balance between local and global structure (typically 5-50). Roughly equivalent to the number of nearest neighbors that influence each point
- Learning Rate: Controls step size in gradient descent. Too high and points may form compact clusters separated by large gaps
- Early Exaggeration: Multiplier for initial iterations that helps separate clusters
- Number of Iterations: More iterations can improve the embedding quality but at computational cost
Applications:
- Visualizing high-dimensional data in 2D or 3D
- Exploring clustering structure in unlabeled data
- Revealing patterns in image, text, and bioinformatics data
- Feature extraction for downstream machine learning tasks
Differences from other techniques:
- Compared to PCA: t-SNE is non-linear and better preserves local structure, but PCA preserves global structure better
- Compared to UMAP: t-SNE focuses more on preserving local structure, while UMAP better balances local and global structure
- Compared to MDS: t-SNE uses probability distributions instead of directly preserving distances
- Advantages: Creates visually appealing plots that intuitively reveal clusters
- Limitations: Sensitive to hyperparameters, computationally intensive, and can produce different results on multiple runs