Policy Gradient - REINFORCE

Concerning Reinforcement Learning on 16 Apr 2021

Mathematics

Reinforcement

MachineLearning

In this article, we will talk about the foundation Policy Gradient algorithm: REINFORCE.

The underlying idea here is to reinforce actions which result in higher cumulative rewards. In simple words, the algorithm increases the probability of action if it results in more rewards.

As in below figure,

\(a_0\) with probability(\(p_0\)): 0.5 results in cumulative reward of 100.
\(a_1\) with probability(\(p_1\)): 0.5 results in cumulative reward of 50.

Hence, increasing \(p_0\) will result in higher rewards on average.

Reinforce

REINFORCE works on this simple idea. Ofcourse, the internal mathematics and technicalities are more complex than above diagram.

Digging Deeper - Mathematically

We would like to learn a policy that can take best actions without consulting the value functions. It becomes a conventional optimization problem as below:

Objective: \({max} {J}(\theta)\), where \({J} {, } \theta\) refer to performance measure and policy parameters respectively.

Using stochastic gradient ascent, we update \(\theta\) in the direction of \(\nabla_\theta{J(\theta)}\).
\(\theta_{t+1} = \theta_{t} + \alpha.\mathbb{E}[\nabla_\theta{J(\theta)}]\)

It turns out that computing gradient of performance is further simplified to:

\(\nabla_\theta{J}(\theta) \propto \mathbb{E_\pi}[{G_t}.\nabla_\theta{log}\pi_\theta(s/a)]\)
where \(G_t\) is the discounted return

Hence, \(\theta\) udpate is re-written as : \(\theta_{t+1} = \theta_{t} + \alpha.G_t.\nabla_\theta{log}\pi_\theta(s/a)\)

Key takeaways from above equation:

\(\nabla_\theta{J}\) is independent of states distribution.
\(\theta\) is updated in proportion to return. If higher return is achieved by taking a particular action, then the probability of taking that action rises in proportion.

Quick peek into the Algo

Before we delve into the detailed algorithm, here are few tacts related to REINFORCE:

A Monte Carlo based method, i.e. it needs knowledge of full episode before learning.
An On Policy algorithm, i.e. only learns the policy which decides the actions.

Here is a snippet of psuedo code taken from Sutton and Barto’s RL Book

Reinforce Algo

A rough REINFORCE flow looks like:

Reinforce Algo

Implementation

There are many implementations available on the web, and I will not write another one here. The idea wanted to highlight is certain nuances while implementing this algorithm. So, sip your coffee slowly now!

Policy Network:

Input – state vector
Method 1: Update the target values in policy network
- Predicted Value – Network’s predicted log probabilities (since softmax is the output layer)
- Target Value –
  - Since, the probability of a promising action should be strengthened, we can use an updated target value
  - \(\pi(s/a) : \pi(s/a) + \beta.\nabla_\theta{J}\), which further reduces to
  - \(\pi(s/a) : \pi(s/a) + \beta.G.\nabla_\theta{log}\pi_\theta(s/a)\).
- Loss function – Cross entropy.
  - Cross Entropy Loss : \(L = \sum(y_i.{log p_i})\)
  - Our Policy Gradient also seeks for log of probability
    # p_i --> action probability # y_i --> 1 for chosen action and 0 for not chosen action # beta --> learning rate for policy network # G --> discounted Rewards gradient = -(p_i - y_i) # this is the gradient of cross entropy loss # update the target value by gradient y_target = p_i + beta * G * gradient

Method 2: Update policy network parameter \(\theta\) directly using tensorflow eager execution

Custom Loss Function –

  losses = []         # accumulate losses 

  # start eager execution
  with tf.GradientTape() as tape:
      for index, sample in enumerate(self.memory):
          # sample is a tuple of current state, actions, rewards, next states

          # get current state in the memory variable
          state = tf.Variable([sample[0]], trainable=True, dtype=tf.float64)

          # given current state, get prob of actions using policy network
          prob = self.PolicyNetwork.model(state, training = True)

          # actual action taken, and taking the log of probability of actual action taken
          action = sample[1]              
          actionProb = prob[0, action]
          logProb = tf.math.log(actionProb)

          # loss of this sample is defined as G*log(p) --> this is as per policy gradient
          sampleloss = logProb * discounted_rewards[index][0]

          losses.append(-sampleloss)      # this is negative, since we are interested in gradient ascent

            
      networkLoss = sum(losses)

      # Back propagation
      # compute the gradient of network loss wrt network parameters, and optimize the network
      grads = tape.gradient(networkLoss, self.PolicyNetwork.model.trainable_variables)
      self.PolicyNetwork.optimizer.apply_gradients(zip(grads, self.PolicyNetwork.model.trainable_variables))

Training the agent:
- Since this is Monte carlo method, agent can only learn once the entire episode has ended

NOTE: Please reach out to me directly for detailed code

Challenges & Potential Solutions

Being a basic algorithm, REINFORCE is bound to have certain challenges of its own.

High Variance and slow learning

Problem
With cumulative rewards varying a lot every episode, the overall lerning becomes a bit noisy with high variance.
A better way to reduce the noise is to normalize the rewards used. This is what we have now
\(\theta_{t+1} = \theta_{t} + \alpha.G_t.\nabla_\theta{log}\pi_\theta(s/a)\)

Solution
Reinforce with Baseline – > If we amend the reward by a function which is independent of action taken, the overall learning becomes faster.
\(\theta_{t+1} = \theta_{t} + \alpha.(G_t - b(s_t)).\nabla_\theta{log}\pi_\theta(s/a)\)
How do we choose baseline:
Normalize Cumulative rewards: \(G_t - b(s_t) = (G_t - \mu(G_t))/\sigma(G_t))\)
Use Value function as baseline. \(G_t - b(s_t) = G_t - V(s_t)\)

Baseline comparison

Episodic Learning

Problem
Being a Monte carlo method, knowledge of entire episode is required.
Becomes impractical in more real-life scenarios when you dont know what may happen in the environment

Solution
We need to move towards Temporal Difference (TD) methods; which we will learn more about in next articles

Conclusion

Here, we learnt about the most basic Policy Gradient alogrithm. Ofcourse, this is just to do some warm up while we work out further deep into PG techniques.

Appendix

References

Reinforcement Learning, An Introduction (Second Edition). Sutton and Barto
Karpathy Github page
Lilianweng Github Page

Policy Gradient - REINFORCE

Digging Deeper - Mathematically

Quick peek into the Algo

Implementation

Challenges & Potential Solutions

High Variance and slow learning

Episodic Learning

Conclusion

Appendix

References

Think Again

Error

Digging Deeper - Mathematically

Quick peek into the Algo

Implementation

Challenges & Potential Solutions

High Variance and slow learning

Episodic Learning

Conclusion

Appendix

References

Templates (for web app):

Error