Policy Iteration

Dynamic programming algorithms that are designed to solve a Markov decision process are iterative algorithms, which consist of two basic steps: policy evaluation and policy improvement. The purpose of policy evaluation is to measure the performance of a given policy undefined by estimating the corresponding value function undefined . Policy improvement on the other hand generates a new policy, that is better (or at least not worse) than the previous policy. The output of policy evaluation is used as an input into policy improvement and vice versa. The iterative process of evaluation and improvement produces value and policy functions that converge towards the optimal policy function undefined and optimal value function undefined over time. The policy iteration algorithm that is covered in this section is one such iterative algorithm.

Policy Evaluation

The goal of policy evaluation is to find the true value function undefined of the policy undefined .

undefined

Often it is more convenient to use the joint probability of simultaneously getting the reward undefined and the next state undefined given current state undefined and action undefined . This joint probability function is depicted as undefined . This notation is more compact and is likely to make the transition from theory to practice easier.

undefined

If we look closely at the Bellman equation, we can observe that the equation basically consists of two sides. The left side and the right side.

undefined

The left side is the function that returns the value of a state undefined , it is the mapping from states to values. The left side is essentially the function we are trying to find. The right side is the definition of the value function, that is based on the expectation of the returns and is expressed using the Bellman equation.

When we initialize the policy evaluation algorithm, the first step is to generate a value function that is used as a benchmark that needs to be constantly improved in the interative process. The initial values are set either randomly or to zero. When we start to use the above equation we will not surprisingly discover that the random/zero value (the left side of the above equation) and the expected value of the reward plus value for the next state (the right side of the above equation) will diverge quite a lot. The goal of the policy evaluation algorithm is to make the left side of the equation and the right side of the equation to be exactly equal. That is done in an iterative process where at each step the difference between both sides is reduced. In practice we do not expect the difference between the two to go all the way down to zero. Instead we define a threshold value. For example a threshold value of 0.0001 indicates that we can interrupt the iterative process as soon as for all the states undefined the difference between the left and the right side of the equation is below the threshold value.

The policy estimation algorithm is relatively straightforward. All we need to do is to turn the definition of the of the Bellman equation into the update rule.

undefined

Above we use undefined instead of undefined . This notational difference is to show that undefined is the true value function of a policy undefined , while undefined is it's estimate. At each iteration step undefined the left side of the equation (the esimate of the value function) is replaced by the right hand of the equation. At this point it should become apparent why the Bellman equation is useful. Only the reward from the next time step is required to improve the approximation, because all subsequent rewards are already condensed into the value function from the next time step. That allows the algorithm to use the model to look only one step into the future for the reward and use the approximated value function for the next time step. By repeating the update rule over and over again the rewards are getting embedded into the value function and the approximation gets better and better.

The process of using past estimates to improve current estimates is called bootstrapping. Bootstrapping is used heavily through reinforcement learning and can generally be used without the full knowledge of the model of the environment.

Below you can find the Python implementation of the policy evaluation algorithm.

Once again below we deal with a simple grid world where the task is to arrive at the bottom left corner starting from the top left corner. The environment transitions with 50% probability into the desired direction (unless there is some barrier) and with 50% chance the environment takes a randomm action. The playground below allows you to calculate the value function for a randomly initialized deterministic policy, using the policy evaluation algorithm. You can switch between the display of the policy and the value function. Start by taking one step of the algorithm at the time and observe how the value function propagates and the difference between steps keeps decreasing. Finally you can run the full policy evaluation algorithm, where the iterative process keeps going until the difference betwenn the left and the right side of the Bellman equation is less than 0.00001.

Policy Improvement

The goal of policy improvement is to create a new and improved policy using the value function undefined from the previous policy evaluation step.

Let us assume for simplicity that the agent follows a deterministic policy undefined , but in the current state undefined the agent contemplates to pick the action undefined that contradicts the policy, therefore undefined . After that action the agent will stick to the old policy undefined and follow it until the terminal state undefined . We can measure the value of using the action undefined at state undefined and then following the policy undefined using the action-value function.

undefined

What if the agent compares the estimates undefined and undefined and determines that taking some action undefined and then following undefined is of higher value than strictly following undefined , showing that undefined ? Does that imply that the agent should change the policy and always take the action undefined when facing the state undefined ? Does the short term gain from the new action undefined justifies changing the policy? It turns out that this is exactly the case.

In the policy improvement step the we create a new policy undefined where the agent chooses the greedy action at each state undefined .

undefined

Below you can find a Python example of the policy improvement step.

The Policy Iteration Algorithm

The idea of policy iteration is to alternate between policy evaluation and policy improvement until the optimal policy has been reached. Once the new policy and the old policy are exactly the same we have reached the optimal policy.

Below is a playground from the same gridworld, that demonstrates the policy iteration algorithm. The algorithm finds the optimal policy and corresponding optimal value function, once you click on the "policy iteration" button.