Return to Algorithms

Policy Gradient Reinforcement Learning

Grid World Environment

Agent learning to navigate through the environment to reach a goal
Agent
Goal
Trap
Wall
Action Probability

Neural Network Models

Policy Network (Actor)

Value Network (Critic)

Action Probabilities

Training Metrics

Episode Rewards

Losses & Entropy

Controls

Environment Setup

Algorithm

Policy Network Parameters

0.0050
0.99
0.01

Value Network Parameters

0.0010
0.50
0.20

Training Parameters

Visualization Options

Status

Episode: 0/0
Total Episodes: 0
Total Steps: 0
Policy Loss: -
Value Loss: -
Entropy: -
Episode Reward: 0
Status: Not Started

Trajectory Information

How Policy Gradient Methods Work

Policy Gradient Methods are a class of reinforcement learning algorithms that directly optimize the policy function by performing gradient ascent on the expected return. Unlike value-based methods (like Q-learning), policy gradient methods directly parameterize and learn the policy, making them well-suited for continuous action spaces and stochastic policies.

Key Algorithms:

  • REINFORCE (Monte Carlo Policy Gradient): The most basic policy gradient algorithm that uses complete episode returns to update the policy. The core update is based on the likelihood ratio policy gradient:
    θJ(θ) = Eτ~p(τ|θ)[∑tθlog πθ(at|st) · Rt]
    where τ is a trajectory, θ are the policy parameters, πθ is the policy, and Rt is the return at time t.
  • Actor-Critic: Combines policy gradient with value function approximation to reduce variance. The critic (value network) estimates the value function, while the actor (policy network) learns the policy using advantage estimates:
    θJ(θ) = Eτ~p(τ|θ)[∑tθlog πθ(at|st) · At]
    where At is the advantage estimate, typically calculated as rt + γV(st+1) - V(st).
  • Proximal Policy Optimization (PPO): A family of policy gradient methods that use a clipped objective function to ensure policy updates aren't too large, improving stability:
    LCLIP(θ) = Et[min(rt(θ)At, clip(rt(θ), 1-ε, 1+ε)At)]
    where rt(θ) is the probability ratio between the new and old policy.

Key Components:

  1. Policy Network (Actor): Neural network that maps states to action probabilities, defining the agent's behavior
  2. Value Network (Critic): In actor-critic methods, estimates the value function to reduce variance in policy updates
  3. Trajectory Collection: Sample actions from the policy and collect state-action-reward sequences
  4. Return Calculation: Compute discounted returns for each step in the trajectory
  5. Policy Gradient Update: Update policy parameters to increase probability of actions that led to high returns
  6. Entropy Regularization: Encourage exploration by adding an entropy bonus to the objective

Advantages of Policy Gradient Methods:

  • Naturally handle continuous action spaces
  • Can learn stochastic policies, which are important for exploration and games with hidden information
  • Policy updates are more stable with appropriate optimization techniques
  • Can directly optimize for the objective of interest
  • Convergence properties are often better understood theoretically

Challenges and Solutions:

  • High Variance: REINFORCE suffers from high variance in gradient estimates. Actor-critic methods reduce variance through value function bootstrapping.
  • Sample Efficiency: Policy gradient methods can be sample-inefficient. Solutions include importance sampling and off-policy learning.
  • Step Size Selection: The policy update can be sensitive to step size. Trust region methods like PPO constrain update sizes.
  • Credit Assignment: It can be difficult to attribute rewards to specific actions. Techniques like reward shaping and advantage functions help.