Q-Learning is a model-free reinforcement learning algorithm that learns the value of actions in states by iteratively updating a Q-table. It's designed to find the optimal action-selection policy for any given Markov Decision Process (MDP).
The Q-Learning Algorithm:
- Initialize Q-table: Set all Q(s,a) values to zero (or small random values)
- For each episode:
- Initialize state s
- For each step in the episode:
- Choose action a from s using policy derived from Q (e.g., ε-greedy)
- Take action a, observe reward r and next state s'
- Update Q(s,a) ← Q(s,a) + α[r + γ·maxa'Q(s',a') - Q(s,a)]
- s ← s'
- If s is terminal, end episode
Key Parameters:
- Learning Rate (α): Controls how much new information overrides old information. Higher values make the agent learn faster but might lead to instability.
- Discount Factor (γ): Determines the importance of future rewards. A value close to 0 makes the agent "myopic" (focused on immediate rewards), while values close to 1 make it strive for long-term rewards.
- Exploration Rate (ε): Controls the exploration-exploitation trade-off. Higher values encourage more random exploration.
Benefits of Q-Learning:
- Model-free: Learns directly from interaction with the environment
- Off-policy: Can learn the optimal policy independently of the agent's actions
- Can handle stochastic transitions and rewards
- Guaranteed to converge to the optimal policy (given sufficient exploration)
Limitations:
- Suffers from the curse of dimensionality - Q-table grows exponentially with state/action space
- Struggles with continuous state or action spaces without discretization
- Slow convergence in complex environments
- Can overestimate Q-values in stochastic environments
Extensions and Variations:
- Deep Q-Network (DQN): Uses neural networks to approximate the Q-function for large state spaces
- Double Q-Learning: Reduces overestimation bias by decoupling action selection from evaluation
- SARSA: On-policy variant that updates based on the actual next action taken
- Expected SARSA: Updates using the expected value of next state-action pairs