Value Iteration

When we consider policy iteration again, we should remember that there are two distinct steps, policy evaluation and policy improvement. The policy improvement step is a single step, where the new policy is derived by acting greedily. The policy evaluation on the other hand is a longer iterative process. It turns out that it is not necessary to wait for the policy evaluation algorithm to finish converging to the true value function. In fact the value iteration algorithm works with only one single policy evaluation step.

The main goal of value iteration is to find the optimal value function undefined , that can be used to derive the optimal policy. The optimal value function can be expressed as a Bellman equation that looks as follows.

undefined

The value iteration is essentially the Bellman optimality equation, that has been transformed to an iterative algorithm.

undefined

Although the update step looks like a single step at first glance, it actually combines truncated (one step) policy evaluation and policy improvement in a single step.

undefined

In the first step the action-value function is calculated based on the old state-value function and the model of the Markov decision process. In the second step a max over the action-value function is taken in order to generate the new state-value function. That implicitly generates a new policy as a value function is always calculated for a particular policy.

The combination of both steps is the value iteration algorithm. The iterative process continues until the difference between the old and the new state-value function is smaller than some parameter theta undefined . As the final step the optimal policy can be deduced by always selecting the greedy action.

Below is the Python implementation of the value iteration algorithm. Compared to policy iteration the implementation is more compact, because policy evaluation and policy improvement can be implemented in a single function.