Policy Gradient - REINFORCE

In this article, we will talk about the foundation Policy Gradient algorithm: REINFORCE.

The underlying idea here is to reinforce actions which result in higher cumulative rewards. In simple words, the algorithm increases the probability of action if it results in more rewards.

As in below figure,

  • \(a_0\) with probability(\(p_0\)): 0.5 results in cumulative reward of 100.
  • \(a_1\) with probability(\(p_1\)): 0.5 results in cumulative reward of 50.

Hence, increasing \(p_0\) will result in higher rewards on average.

Reinforce

REINFORCE works on this simple idea. Ofcourse, the internal mathematics and technicalities are more complex than above diagram.

Digging Deeper - Mathematically

We would like to learn a policy that can take best actions without consulting the value functions. It becomes a conventional optimization problem as below:

Objective: \({max} {J}(\theta)\), where \({J} {, } \theta\) refer to performance measure and policy parameters respectively.

Using stochastic gradient ascent, we update \(\theta\) in the direction of \(\nabla_\theta{J(\theta)}\).
\(\theta_{t+1} = \theta_{t} + \alpha.\mathbb{E}[\nabla_\theta{J(\theta)}]\)

It turns out that computing gradient of performance is further simplified to:

\(\nabla_\theta{J}(\theta) \propto \mathbb{E_\pi}[{G_t}.\nabla_\theta{log}\pi_\theta(s/a)]\)
where \(G_t\) is the discounted return

Hence, \(\theta\) udpate is re-written as : \(\theta_{t+1} = \theta_{t} + \alpha.G_t.\nabla_\theta{log}\pi_\theta(s/a)\)

Key takeaways from above equation:

  1. \(\nabla_\theta{J}\) is independent of states distribution.
  2. \(\theta\) is updated in proportion to return. If higher return is achieved by taking a particular action, then the probability of taking that action rises in proportion.

Quick peek into the Algo

Before we delve into the detailed algorithm, here are few tacts related to REINFORCE:

Here is a snippet of psuedo code taken from Sutton and Barto’s RL Book

Reinforce Algo

A rough REINFORCE flow looks like:

Reinforce Algo

Implementation

There are many implementations available on the web, and I will not write another one here. The idea wanted to highlight is certain nuances while implementing this algorithm. So, sip your coffee slowly now!

  • Policy Network:
    • Input – state vector
    • Method 1: Update the target values in policy network
      • Predicted Value – Network’s predicted log probabilities (since softmax is the output layer)
      • Target Value –
        • Since, the probability of a promising action should be strengthened, we can use an updated target value
        • \(\pi(s/a) : \pi(s/a) + \beta.\nabla_\theta{J}\), which further reduces to
        • \(\pi(s/a) : \pi(s/a) + \beta.G.\nabla_\theta{log}\pi_\theta(s/a)\).
      • Loss function – Cross entropy.
        • Cross Entropy Loss : \(L = \sum(y_i.{log p_i})\)
        • Our Policy Gradient also seeks for log of probability

            # p_i --> action probability
            # y_i --> 1 for chosen action and 0 for not chosen action
            # beta --> learning rate for policy network
            # G --> discounted Rewards
          
            gradient = -(p_i - y_i)           # this is the gradient of cross entropy loss
          
            # update the target value by gradient
            y_target = p_i +  beta * G * gradient        
          
    • Method 2: Update policy network parameter \(\theta\) directly using tensorflow eager execution
      • Custom Loss Function –
          losses = []         # accumulate losses 
        
          # start eager execution
          with tf.GradientTape() as tape:
              for index, sample in enumerate(self.memory):
                  # sample is a tuple of current state, actions, rewards, next states
        
                  # get current state in the memory variable
                  state = tf.Variable([sample[0]], trainable=True, dtype=tf.float64)
        
                  # given current state, get prob of actions using policy network
                  prob = self.PolicyNetwork.model(state, training = True)
        
                  # actual action taken, and taking the log of probability of actual action taken
                  action = sample[1]              
                  actionProb = prob[0, action]
                  logProb = tf.math.log(actionProb)
        
                  # loss of this sample is defined as G*log(p) --> this is as per policy gradient
                  sampleloss = logProb * discounted_rewards[index][0]
        
                  losses.append(-sampleloss)      # this is negative, since we are interested in gradient ascent
        
                    
              networkLoss = sum(losses)
        
              # Back propagation
              # compute the gradient of network loss wrt network parameters, and optimize the network
              grads = tape.gradient(networkLoss, self.PolicyNetwork.model.trainable_variables)
              self.PolicyNetwork.optimizer.apply_gradients(zip(grads, self.PolicyNetwork.model.trainable_variables))
        
  • Training the agent:
    • Since this is Monte carlo method, agent can only learn once the entire episode has ended

NOTE: Please reach out to me directly for detailed code

Challenges & Potential Solutions

Being a basic algorithm, REINFORCE is bound to have certain challenges of its own.

High Variance and slow learning

Problem

  • With cumulative rewards varying a lot every episode, the overall lerning becomes a bit noisy with high variance.
  • A better way to reduce the noise is to normalize the rewards used. This is what we have now
    \(\theta_{t+1} = \theta_{t} + \alpha.G_t.\nabla_\theta{log}\pi_\theta(s/a)\)

Solution

  • Reinforce with Baseline – > If we amend the reward by a function which is independent of action taken, the overall learning becomes faster.
    \(\theta_{t+1} = \theta_{t} + \alpha.(G_t - b(s_t)).\nabla_\theta{log}\pi_\theta(s/a)\)
  • How do we choose baseline:
    1. Normalize Cumulative rewards: \(G_t - b(s_t) = (G_t - \mu(G_t))/\sigma(G_t))\)
    2. Use Value function as baseline. \(G_t - b(s_t) = G_t - V(s_t)\)

Baseline comparison

Episodic Learning

Problem

  • Being a Monte carlo method, knowledge of entire episode is required.
  • Becomes impractical in more real-life scenarios when you dont know what may happen in the environment

Solution

  • We need to move towards Temporal Difference (TD) methods; which we will learn more about in next articles

Conclusion

Here, we learnt about the most basic Policy Gradient alogrithm. Ofcourse, this is just to do some warm up while we work out further deep into PG techniques.


Appendix

References

  1. Reinforcement Learning, An Introduction (Second Edition). Sutton and Barto
  2. Karpathy Github page
  3. Lilianweng Github Page

© 2025. All rights reserved.