Solution to a Markov Decision Process

The reward hypothesis states that the goal of the environment is encoded in the rewards. If you want the agent to learn how to play chess for example you give out a reward of +1 when the agent wins, a reward of -1 when the agent loses and a reward of 0 whenever there is a draw. The goal of the agent is therefore to collect as many rewards as possible.

We can express the same idea mathematically by defining the goal undefined , also called the return, as the sum of rewards from the timestep undefined to either the terminal state undefined if we are dealing with episodic tasks or to infinity if we are dealing with continuing tasks.

undefined

To demonstrate this idea with a simple example let's once again take the MDP from the previous section. We are dealing with two states and an agent that takes random actions. Whenever the agent lands in the state 1 it receives a reward of 1, other wise it receives a negative reward of -1.

undefined
0 1 0 1 A B -1 5 -1 5
undefined
undefined
undefined
undefined
undefined
undefined
undefined
0 1 0 1 A B -1 5 -1 5
undefined
undefined
undefined
undefined
undefined
undefined

In the example below we calculate the return from the perspective of the initial state undefined . Each time the agent receives a reward we add the value to the return value.

Timestep undefined : 0 Return undefined : 0

In order to reduce the value of future rewards we discount each reward, by applying a discount factor undefined , a value between 0 and 1. The farther away the reward is from the current time step undefined , the lower its contribution to the return undefined .

undefined

The easiest way to illustrate this concept is by thinking about the time value of money. If we have to choose between getting 1000$ now and getting 1000$ in 10 years, we should definetely choose the 1000$ in the present. The money could be invested for a ten year period, such that in 10 years the investor gets back an amout that is larger than 1000$. A reward now is always more valuable thant a future reward.

Discounting is especially important for continuing tasks. If the episode has no natural ending, the return undefined could theoretically become infinite. Discounted rewards on the other hand approach 0 when they are far into the future.

The interactive example below shows how the discounting rate dependings on the value of the undefined . The lower the gamma the faster the progression towards 0.

0102030405060708090100 00.20.40.60.81 Steps Discount Factor
0.95

The return is very tighly integrated with the strategy thate the agent is following. This strategy is called policy. A better policy will obviously generate better returns and vice versa.

Info

A policy is a mapping from a state to a probability of an action.

A policy undefined as a mapping from a particular state undefined to a probability of an action undefined : undefined . To take an action in a state the agent simply draws an action from the policy distribution undefined .

In the above example we are dealing with 2 possible states and 2 possible actions. We can therefore simply store the policy in a mapping table. In future chapters we will use neural networks for more complex MDPs.

State Action Probability
0 A 0.5
0 B 0.5
1 A 0.5
1 B 0.5

The goal of the agent is to find a policy that maximizes returns, but simply maximizing the return is tricky, because in reinforcement learning we often deal with stochastic policies and environments. That stochasticity produces different trajectories and therefore different returns undefined . But how can we measure how good it is for the agent to use a certain policy undefined if the generated returns are not consistent? By utilizing value functions.

Info

A value function measures the expected value of returns.

The state-value function undefined calculates the expected return when the agent is in the state undefined and follows the policy undefined .

undefined

The action-value function undefined calculates the expected return when the agent is the state undefined , takes the action undefined at first and then keeps following the policy undefined .

undefined

For the time being we do not have the tools to exactly calculate the value function for each state, but we can make an educated guess. Let's take the same Markov decision we have been dealing so far, but only focus on the state 0.

The random policy that we have been using so far is clearly not optimal.

undefined
0 1 0 1 A B -1 5 -1 5
undefined
undefined
undefined
undefined
undefined
undefined

The below agent implements a policy that tends to take the action B more often. This behaviour will generate a higher expected return.

undefined
0 1 0 1 A B -1 5 -1 5
undefined
undefined
undefined
undefined
undefined
undefined

A state-value function allows the agent to assign a goodness value to each of the states for a given policy. Change the policy and observe if you see any improvements. That is where the action-value function comes into play. The action-value function allows us to reason about changing a single action. We can ask the question what would happen if we generally keep the same policy and only change the next action. Would that new policy be more benefitial to us? If yes, maybe we should change our policy. And the goal of the agent is to find the optimal policy.

Info

For the agent to solve the Markov decision process means to find the optimal policy.

Optimality implies that there is a way to compare different policies and to determine which of the policies is better. In a Markov decision process value functions are used as a metric of the goodness of a policy. The policy undefined is said to be better than the policy undefined if and only if the value function of undefined is larger or equal to the value function of policy undefined for all states in the state set undefined .

undefined

The optimal policy undefined is therefore the policy that is better (or at least not worse) than any other policy.

undefined

It is not hard to imagine the optimal strategy for the two-states MDP. The agent needs to always take the action B. This policy increases the probability to land in the state 2 and thus increases the probability of getting a reward of 5. In other words always taking the actio B would maximize the value function.

undefined
0 1 0 1 A B -1 5 -1 5
undefined
undefined
undefined
undefined
undefined
undefined
undefined
0 1 0 1 A B -1 5 -1 5
undefined
undefined
undefined
undefined
undefined
undefined

How we can numerically find the optimal policy undefined and the corresponding value functions undefined , undefined is going to be the topic of the rest of the reinforcement learning block. The essence of reinforcement learning is to find the optimal policy, given that the environment contains a Markov decision process under the hood.