Policy Gradient Methods are a class of reinforcement learning algorithms that directly optimize the policy function by performing gradient ascent on the expected return. Unlike value-based methods (like Q-learning), policy gradient methods directly parameterize and learn the policy, making them well-suited for continuous action spaces and stochastic policies.
Key Algorithms:
-
REINFORCE (Monte Carlo Policy Gradient): The most basic policy gradient algorithm that uses complete episode returns to update the policy. The core update is based on the likelihood ratio policy gradient:
∇θJ(θ) = Eτ~p(τ|θ)[∑t ∇θlog πθ(at|st) · Rt]
where τ is a trajectory, θ are the policy parameters, πθ is the policy, and Rt is the return at time t.
-
Actor-Critic: Combines policy gradient with value function approximation to reduce variance. The critic (value network) estimates the value function, while the actor (policy network) learns the policy using advantage estimates:
∇θJ(θ) = Eτ~p(τ|θ)[∑t ∇θlog πθ(at|st) · At]
where At is the advantage estimate, typically calculated as rt + γV(st+1) - V(st).
-
Proximal Policy Optimization (PPO): A family of policy gradient methods that use a clipped objective function to ensure policy updates aren't too large, improving stability:
LCLIP(θ) = Et[min(rt(θ)At, clip(rt(θ), 1-ε, 1+ε)At)]
where rt(θ) is the probability ratio between the new and old policy.
Key Components:
- Policy Network (Actor): Neural network that maps states to action probabilities, defining the agent's behavior
- Value Network (Critic): In actor-critic methods, estimates the value function to reduce variance in policy updates
- Trajectory Collection: Sample actions from the policy and collect state-action-reward sequences
- Return Calculation: Compute discounted returns for each step in the trajectory
- Policy Gradient Update: Update policy parameters to increase probability of actions that led to high returns
- Entropy Regularization: Encourage exploration by adding an entropy bonus to the objective
Advantages of Policy Gradient Methods:
- Naturally handle continuous action spaces
- Can learn stochastic policies, which are important for exploration and games with hidden information
- Policy updates are more stable with appropriate optimization techniques
- Can directly optimize for the objective of interest
- Convergence properties are often better understood theoretically
Challenges and Solutions:
- High Variance: REINFORCE suffers from high variance in gradient estimates. Actor-critic methods reduce variance through value function bootstrapping.
- Sample Efficiency: Policy gradient methods can be sample-inefficient. Solutions include importance sampling and off-policy learning.
- Step Size Selection: The policy update can be sensitive to step size. Trust region methods like PPO constrain update sizes.
- Credit Assignment: It can be difficult to attribute rewards to specific actions. Techniques like reward shaping and advantage functions help.