Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that preserves both local and global structure of high-dimensional data. It's particularly effective for visualization and clustering tasks.
The UMAP Algorithm:
- Construct Neighborhood Graph: Build a weighted k-nearest neighbor graph representing the high-dimensional manifold structure
- Compute Fuzzy Topological Representation: Transform the neighborhood graph into a fuzzy topological representation using local connectivity patterns
- Optimize Low-Dimensional Layout: Find a low-dimensional embedding that preserves the topological structure using stochastic gradient descent
- Refine Embedding: Iteratively adjust the low-dimensional coordinates to minimize the difference between high and low-dimensional fuzzy topological representations
Key Parameters:
- n_neighbors: Controls how UMAP balances local versus global structure. Lower values (2-15) emphasize local structure, higher values (50-200) preserve more global structure
- min_dist: Controls how tightly UMAP packs points together. Smaller values create tighter, more clustered embeddings
- metric: The distance function used to measure similarity between points
- n_components: The dimensionality of the output embedding
Applications:
- Visualization of high-dimensional data
- Pre-processing for machine learning algorithms
- Feature extraction and selection
- Discovering clusters and patterns in complex datasets
- Single-cell RNA sequencing analysis
Advantages over other techniques:
- Better preservation of both local and global structure than t-SNE
- More faithful representation of distances than PCA
- Faster computation than t-SNE for large datasets
- Theoretical foundation in Riemannian geometry and algebraic topology
- Works well across various domains and data types