Exploration-Exploitation Dilemma

At each timestep the agent has to make the decision to either explore the environment or to exploit the current knowledge about the environment. The problem that the agent faces when deciding between the two options is the so called exploration-exploitation dilemma.

Info

The agent can either explore the environment or exploit the already accumulated knowledge. The exploration-exploitation dilemma describes the fact that the agent can not do both at the same time.

On the one hand the agent aims to get the highest sum of rewards that is achievable based on the current knowledge - it wants to exploit. On the other hand in order to find a sequence of actions which lead to a higher sum of rewards the agent needs to explore the environment. The dilemma is the fact that the agent can not do both at the same time. At each single step the agent either explores or exploits.

Exploration in Deterministic Environments

Info

Deterministic Environment: Given the same state of the environment and the same action by the agent the next state and the corresponding reward are always the same.

The grid environment we covered so far was deterministic. We assumed that there is no uncertainty and given the same circumstances the outcome would be the same. For example whenever the agent chose the action to go right in the first state the environment transitioned in such a way that the circle moved actually move right. Each and every single time. Yet even in a deterministic environment the agent has to explore in order to find the optimal sequence of actions.

The grid world below shows an agent which has discovered the shortest route from the starting position to the goal position (triangle). At each timestep the agent earns a negative reward of -1. Once the agent reaches the goal, the environment gives a positive feedback of +1 reward and the game restarts. Generally the agent might keep taking the same path to reach the triangle, but if it kept exploring the environment it could discover that there is actually a big reward of +10 in the bottom right corner. The agent could pick up the reward first and then keep moving towards the goal. The high reward would make up for the additional few steps. In this deterministic example exploration would enable the agent to learn a strategy with a higher sum of rewards.

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 10

Exploration in Stochastic Environments

Info

Stochastic Environment: Given the same state of the environment and the same action by the agent, the next state and the corresponding reward are calculated using a probability distribution.

Most of the environments (or the real world for that matter) are not deterministic, they are stochastic. That means that the next state and reward are calculated based on a probability distribution. That means that given the same state and action, the next state and rewards are not going to be consistent.

As in the previous examples the agent above is already trained and tries to follow the shortest route. The grid world on the other hand represents a stochastic environment. The environment transitions into the desired state of the agent with the probability of 50%. With the probability of 50% the direction is chosen randomly (this might also include the desired direction). For an untrained agent this makes the job of finding the shortest route a lot more complex. The agent does not know exactly how the distribution of the environment looks like. Therefore the agent has to explore and to determine the path that leads to the highest sum of rewards in expectation.

Info

The goal of the agent is to maximize the expected sum of rewards.

In stochastic environments the agent has to maximize the expected sum of rewards. Intuitively speaking that means that the agent has to choose the strategy that would give him the largest sum of rewards if the agent played an infinite number of games.