States, Actions, Rewards

In reinforcement learning the agent and the environment interact with each other. In this context interaction means that signals flows sequentially between the two. The agent and the environment interact continuously, each reacting to the data sent by the other.

Info

In reinforcement learning the sequential information flow between the agent and the environment is called interaction.

It is important to understand that this stream of data is exchanged in a strictly sequential way. When the environment sends a signal for example, it has to wait until it receives the response signal from the agent. Only then can the environment generate a new batch of data. Reinforcement learning works in discrete timesteps. Each iteration where the environment and the agent exchanged their data constitutes a timestep.

Info

In reinforcement learning there are just 3 types of data that need to be send between the agent and the environment: states, actions and rewards.

Agent Environment
undefined
undefined
undefined

The interaction cycle starts with the the agent receiving the initial state undefined from the environment. Based on that state the agent generates the action undefined it would like to take, which is transmitted to the environment. The environment transitions into the new state undefined and calculates the reward undefined . The new state and the reward are finally transmitted to the agent. The agent can use the reward as a feedback to learn, while the new state is used to generates the action undefined and the cycle keeps repeating, ponentially forever.

State

Info

The state is the representation of the current condition of the environment.

The state describes how the environment actually looks like. It is the condition that the agent is facing and the one parameter that the agent bases its decisions on.

In our simple gridworld example all the agent needs to know to make the decisions is the location of the circle in the environment. In the starting position the state would be row=0 and column=0. The state to the right of the starting position would be row equals to 0 and column equals to 1, meaning (0, 1). Based on the position the agent can choose the path towards the triangle.

Column Row

This is not the only way to represent the the state of the environment. The state can be represented by a scalar, a vector, a matrix or a tensor and can be either discrete or continuous. In future chapters we will see more complex environments and learn how to deal with those. For now it is sufficient to know what role the state plays in the action-environment interaction.

Action

Info

The action is the representation of the decision of the agent.

The action is the behaviour the agent chooses based on the state of the environment. Like the state the action can be a scalar, a vector, a matrix or a tensor of discrete or continuous values.

In the gridworld example the agent can move north, east, south and west. Each action is encoded by a discrete scalar value, where north equals 0, east equals 1, south equals 2 and west equals 3.

Action
null

Reward

Info

The reward is the scalar signal to reinforce certain behaviour of the agent.

The reward is what the agent receives from the environment for an action. It is the value that the environment uses to reinforce a behaviour and it is the value that the agent uses to improve it's behaviour.

Unlike the action or the state the reward has to be a scalar, one single number, it is not possible for the reward to be a vector, matrix or tensor. As expected larger numbers represent larger or better rewards so that the reward of 1 is higher than the reward of -1.

In this gridworld example the agent receives a reward of -1 for each step taken with the exception of reaching the triangle, where the agent receives a reward of +1.

Reward
null