Definition Of Reinforcement Learning
There are probably dozens of formal definitions of reinforcement learning. These definitions do not necessarily contradict each other, but rather explain something similar when we look a little deeper at what the definitions are trying to convey. In this section we are going to look at the one definition that should capture the essence of reinforcement learning in a very clear way.
Info
Reinforcement Learning is characterized by learning through trial and error and delayed rewards[1] .
The definition consists of three distinct parts: Learning, Trial and Error and Delayed Rewards. In order to understand the complete definition we will deconstruct the sentence and look at each part individually.
Learning
Learning is probably the most obvious part of the definition. When the agent starts to interact with the environment the agent does not know anything about that environment, but the environment contains some goal that the agent has to achieve.
In the example above the agent is expected to move the circle from the starting cell (top left corner) to the goal cell (bottom left corner).
Info
Learning means that the agent gets better at achieving the goal of the environment over time.
When we talk about learning we imply that the agent gets better at achieving that particular goal over time. The agent would probably move randomly at first, but over time learn the best possible (meaning the shortest) route.
Rewards
The question still remains how exactly does the agent figure out what the goal of the environment actually is? The environment with which the agent interacts gives feedback about the behaviour of the agent by giving out a reward after each single step that the agent takes.
Info
In reinforcement learning the agent learns to maximize rewards. The goal of the environment has to be implicitly contained in the rewards.
If the goal of the grid world environment is to move the circle to the cell with the triangle as fast as possible the environment could for example give a positive reward when the agent reaches the goal cell and punish the agent in any other case.
The above animation represents that idea by color-coding rewards. The red grid cells give a reward of -1. The blue grid cell gives a reward of +1. If the agent takes a random route to the triangle, then the sum of rewards is going to be very negative. If on the other hand like in the animation above the agent takes the direct route to the triangle, the sum of rewards is going to be larger (but still negative). The agent learns through the reward feedback that some sequences of actions are better than others. Generally speaking the agent needs to find the routes that produce high sum of rewards.
Trial and Error
The problem with rewards is that it is not clear from the very beginning what path produces the highest possible sum of rewards. It is therefore not clear which sequence of actions the agent needs to take. In reinforcement learning the only feedback the agent receives is the reward signal and even if the agent receives a positive sum of rewards it never knows if it could have done better. Unlike in supervised learning, there is no teacher (a.k.a. supervisor) to tell the agent what the best behaviour is. So how can the agent figure out what sequence of actions produces the highest sum of rewards? The only way it can: by trial and error.
The agent has to try out different behaviour and produce different sequences of rewards to figure out which sequence of actions is the optimal one. How long it takes the agent to find a good sequence of decisions depends on the complexity of the environment and the employed learning algorithm.
Trial Nr. 1
Trial Nr. 2
Trial Nr. 3
The above figures show how the sequences of actions might look like in the gridworld environment after three trials. In the second trial the agend takes the shortest route and has therefore the highest sum of rewards. It might therefore be a good idea to follow the first sequence of actions more often that the sequence of actions taken in the first and third trial.
Info
In the context of reinforcement learning, trial and error means trying out different sequences of actions and compare the resulting sum of rewards to learn optimal behaviour.
Delayed
In reinforcement learning the agent often needs to take dozens or even thousands of steps before a reward is achieved. In that case there has been a succession of many steps and the agent has to decide which step and in which proportion is responsible for the reward, so that the agent could select the decisions that lead to a good sequence of rewards more often.
Info
In reinforcement learning rewards for an action are often delayed, which leads to the credit assignment problem.
Which of the steps is responsible for a particular reward? Is it the action just prior to the reward? Or the one before that? Or the one before that? Reinforcement Learning has no easy answer to the question which decision gets the credit for the reward. This problem is called the credit assignment problem.
Let's assume that in the grid world example the agent took 10 steps to reach the goal. The first reward can only be assigned to the first action. The second reward can be assigned to the first and the second action. And so on. The last (and positive) reward can theoretically be assigned to any of the actions taken prior to the reward.