Interactive Q-Learning Visualization

Grid World Environment

Agent learning to navigate through the environment to reach a goal

Agent

Goal

Trap

Wall

Q-Value Intensity

Learning Progress

Performance metrics over training episodes

Controls

Environment Setup

Grid Size:

Environment Type:

Edit Environment

Wall

Start

Goal

Trap

Erase

Q-Learning Parameters

Learning Rate (α):

0.10

Discount Factor (γ):

0.90

Exploration Rate (ε):

0.10

Training Episodes:

Visualization Options

Show Q-Values

Show Policy

Show Visit Heatmap

Animation Speed:

Status

Episode: 0/0

Step: 0

Total Reward: 0

Status: Not Started

Q-Table

How Q-Learning Works

Q-Learning is a model-free reinforcement learning algorithm that learns the value of actions in states by iteratively updating a Q-table. It's designed to find the optimal action-selection policy for any given Markov Decision Process (MDP).

The Q-Learning Algorithm:

Initialize Q-table: Set all Q(s,a) values to zero (or small random values)
For each episode:
- Initialize state s
- For each step in the episode:
  - Choose action a from s using policy derived from Q (e.g., ε-greedy)
  - Take action a, observe reward r and next state s'
  - Update Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]
  - s ← s'
  - If s is terminal, end episode

Key Parameters:

Learning Rate (α): Controls how much new information overrides old information. Higher values make the agent learn faster but might lead to instability.
Discount Factor (γ): Determines the importance of future rewards. A value close to 0 makes the agent "myopic" (focused on immediate rewards), while values close to 1 make it strive for long-term rewards.
Exploration Rate (ε): Controls the exploration-exploitation trade-off. Higher values encourage more random exploration.

Benefits of Q-Learning:

Model-free: Learns directly from interaction with the environment
Off-policy: Can learn the optimal policy independently of the agent's actions
Can handle stochastic transitions and rewards
Guaranteed to converge to the optimal policy (given sufficient exploration)

Limitations:

Suffers from the curse of dimensionality - Q-table grows exponentially with state/action space
Struggles with continuous state or action spaces without discretization
Slow convergence in complex environments
Can overestimate Q-values in stochastic environments

Extensions and Variations:

Deep Q-Network (DQN): Uses neural networks to approximate the Q-function for large state spaces
Double Q-Learning: Reduces overestimation bias by decoupling action selection from evaluation
SARSA: On-policy variant that updates based on the actual next action taken
Expected SARSA: Updates using the expected value of next state-action pairs

Q-Learning Reinforcement Learning