Deep Q-Network (DQN) is a reinforcement learning algorithm that combines Q-learning with deep neural networks to handle high-dimensional state spaces. It was introduced by DeepMind in 2015 and represents a breakthrough in reinforcement learning, enabling agents to learn directly from raw sensory inputs.
Key Components of DQN:
- Neural Network as Function Approximator: Instead of maintaining a Q-table, DQN uses a neural network to approximate the Q-function, allowing it to handle much larger state spaces.
- Experience Replay: DQN stores experiences (state, action, reward, next state) in a replay buffer and samples random batches for training, which breaks correlations between consecutive samples and improves data efficiency.
- Target Network: A separate "target" network is used for generating the targets in the Q-learning update, which is periodically updated with the weights of the main network to improve stability.
- Epsilon-Greedy Exploration: DQN starts with a high exploration rate (epsilon) that gradually decreases over time, balancing exploration and exploitation.
DQN Algorithm:
- Initialize replay memory D to capacity N
- Initialize action-value function Q with random weights θ
- Initialize target action-value function Q̂ with weights θ⁻ = θ
- For each episode:
- Initialize state s₁
- For each step of the episode:
- With probability ε select a random action aₜ, otherwise select aₜ = argmax_a Q(sₜ,a;θ)
- Execute action aₜ in the environment and observe reward rₜ and next state sₜ₊₁
- Store transition (sₜ, aₜ, rₜ, sₜ₊₁) in replay memory D
- Sample random mini-batch of transitions from D
- Set y_j = rⱼ if episode terminates at step j+1, otherwise y_j = rⱼ + γ max_a' Q̂(sⱼ₊₁,a';θ⁻)
- Perform a gradient descent step on (y_j - Q(sⱼ,aⱼ;θ))² with respect to θ
- Every C steps update target network parameters: θ⁻ = θ
Improvements and Variations:
- Double DQN: Addresses the overestimation bias in Q-learning by using the online network to select actions and the target network to evaluate them.
- Dueling DQN: Separates the value and advantage functions, allowing the network to learn which states are valuable without having to learn the effect of each action.
- Prioritized Experience Replay: Samples transitions with higher expected learning progress more frequently.
- Noisy DQN: Uses noisy linear layers for directed exploration instead of epsilon-greedy.
- Rainbow DQN: Combines multiple improvements for state-of-the-art performance.
Advantages over traditional Q-Learning:
- Can handle high-dimensional state spaces (e.g., pixels from a game screen)
- Better generalization to unseen states through function approximation
- Experience replay improves data efficiency and breaks correlations
- More stable learning through target networks
- Ability to learn complex strategies in challenging environments