There are fundamental principles that define reinforcement learning problems:
The world is broken into states with different actions available.
Upon the execution of an action, evaluative feedback is received from the environment and the next state is observed.
Some algorithms rely on knowing the probability of reaching another state given an action, and the average expected reward.
The accumulation of reward (discounted over time) is known as the return.
The policy expresses which actions should be taken in which state, either as a probability distribution or as a single action (deterministic).
Using the concept of state and action values, algorithms can be used to determine the benefit of each situation. There are three major RL approaches:
Dynamic programming relies heavily on mathematical knowledge of the environment, updating all the estimates as a set of parallel equations.
Monte Carlo methods use many random learning episodes, using statistics to extract the trends. Full-depth backup is used to propagate the reward from the end of each episode.
Temporal-difference learning techniques update the estimates of state incrementally by using the value of neighboring states (bootstrapping).
Each of these techniques is applicable in different context. The next chapter uses a temporal-difference learning approach to provide adaptive strategies for deathmatch games.
The animat demonstrating the theory in this chapter is known as Inforcer. It uses various forms of reinforcement learning in different parts of the AI architecture. Inforcer benefits from having a modular reward signal, which isolates relevant feedback for each component. If learning happens uniformly, the early behaviors are not particularly realistic, but do reach acceptable levels over time.