Return to Algorithms

Uniform Manifold Approximation and Projection (UMAP)

Original High-Dimensional Data

Partial view of high-dimensional data (only 3 dimensions shown)

UMAP Projection

Low-dimensional UMAP embedding of the data

Controls

Data Generation

0.1

UMAP Parameters

15
0.10

Visualization Options

4.0

UMAP Stats

Current Step: Not Started
Iteration: -
Status: Not Started

Nearest Neighbor Graph

k-Nearest Neighbor graph used by UMAP

How UMAP Works

Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that preserves both local and global structure of high-dimensional data. It's particularly effective for visualization and clustering tasks.

The UMAP Algorithm:

  1. Construct Neighborhood Graph: Build a weighted k-nearest neighbor graph representing the high-dimensional manifold structure
  2. Compute Fuzzy Topological Representation: Transform the neighborhood graph into a fuzzy topological representation using local connectivity patterns
  3. Optimize Low-Dimensional Layout: Find a low-dimensional embedding that preserves the topological structure using stochastic gradient descent
  4. Refine Embedding: Iteratively adjust the low-dimensional coordinates to minimize the difference between high and low-dimensional fuzzy topological representations

Key Parameters:

  • n_neighbors: Controls how UMAP balances local versus global structure. Lower values (2-15) emphasize local structure, higher values (50-200) preserve more global structure
  • min_dist: Controls how tightly UMAP packs points together. Smaller values create tighter, more clustered embeddings
  • metric: The distance function used to measure similarity between points
  • n_components: The dimensionality of the output embedding

Applications:

  • Visualization of high-dimensional data
  • Pre-processing for machine learning algorithms
  • Feature extraction and selection
  • Discovering clusters and patterns in complex datasets
  • Single-cell RNA sequencing analysis

Advantages over other techniques:

  • Better preservation of both local and global structure than t-SNE
  • More faithful representation of distances than PCA
  • Faster computation than t-SNE for large datasets
  • Theoretical foundation in Riemannian geometry and algebraic topology
  • Works well across various domains and data types