Return to Algorithms

Q-Learning Reinforcement Learning

Grid World Environment

Agent learning to navigate through the environment to reach a goal
Agent
Goal
Trap
Wall
Q-Value Intensity

Learning Progress

Performance metrics over training episodes

Controls

Environment Setup

Q-Learning Parameters

0.10
0.90
0.10

Visualization Options

Status

Episode: 0/0
Step: 0
Total Reward: 0
Status: Not Started

Q-Table

How Q-Learning Works

Q-Learning is a model-free reinforcement learning algorithm that learns the value of actions in states by iteratively updating a Q-table. It's designed to find the optimal action-selection policy for any given Markov Decision Process (MDP).

The Q-Learning Algorithm:

  1. Initialize Q-table: Set all Q(s,a) values to zero (or small random values)
  2. For each episode:
    • Initialize state s
    • For each step in the episode:
      • Choose action a from s using policy derived from Q (e.g., ε-greedy)
      • Take action a, observe reward r and next state s'
      • Update Q(s,a) ← Q(s,a) + α[r + γ·maxa'Q(s',a') - Q(s,a)]
      • s ← s'
      • If s is terminal, end episode

Key Parameters:

  • Learning Rate (α): Controls how much new information overrides old information. Higher values make the agent learn faster but might lead to instability.
  • Discount Factor (γ): Determines the importance of future rewards. A value close to 0 makes the agent "myopic" (focused on immediate rewards), while values close to 1 make it strive for long-term rewards.
  • Exploration Rate (ε): Controls the exploration-exploitation trade-off. Higher values encourage more random exploration.

Benefits of Q-Learning:

  • Model-free: Learns directly from interaction with the environment
  • Off-policy: Can learn the optimal policy independently of the agent's actions
  • Can handle stochastic transitions and rewards
  • Guaranteed to converge to the optimal policy (given sufficient exploration)

Limitations:

  • Suffers from the curse of dimensionality - Q-table grows exponentially with state/action space
  • Struggles with continuous state or action spaces without discretization
  • Slow convergence in complex environments
  • Can overestimate Q-values in stochastic environments

Extensions and Variations:

  • Deep Q-Network (DQN): Uses neural networks to approximate the Q-function for large state spaces
  • Double Q-Learning: Reduces overestimation bias by decoupling action selection from evaluation
  • SARSA: On-policy variant that updates based on the actual next action taken
  • Expected SARSA: Updates using the expected value of next state-action pairs