Reinforcement Learning Algorithms
There are three major types of algorithms based on mathematical models, statistics, or incremental learning.
Dynamic Programming Techniques
Essentially, there are many brute-force approaches—based on dynamic programming (DP)—for computing policies and state values. Despite being exhaustive, they prove surprisingly efficient. DP works in two phases; the first estimates the state values of the states based on the policy (policy evaluation), and the other improves the policy using the state values (policy improvement) [Sutton97]:
Taking a popular example with a 2D maze, the policy evaluation computes the distance of each square from the exit, and the policy improvement tries to change individual steps to reach the exit quicker.
One thing to note is that DP solutions rely on knowledge of transition probabilities and expected rewards (from state s to s' given action a). Let's look at two ways to combine these separate phases into algorithms for computing ideal policies.
The initialization is done by assigning suitable estimates to states (for instance, 0) and randomly picking any policy. Then, phase 1 evaluates the policy by computing the most up-to-date state value (see Listing 46.2).
Listing 46.2 Knowing the World Model, Policy Iteration Can Be Used to Compute the Value of Each State (The loop terminates when the updates are small enough. The state value estimation function from Listing 46.1 is used.)
repeat max = 0 for each state s old = V(s) # compute the new estimate for this state V(s) = state_value( s ) # check if it was improved max = MAX(max, ABS(old—V(s))) end for until max < threshold
Then, to improve the policy, we check all the actions to see whether there is a better one. Both algorithms in Listing 46.2 and 46.3 are within the same larger loop, interleaving computation until the policy is stable.
Listing 46.3 Policy Improvement Is Used to Scan Each State, Checking Whether There Is a Better Action to Be Taken
p is a deterministic policy suggesting one action per state Q is the value for each state/action pair stable = true for each state s previous = (s) best = – # determine if any other action is better than the current for each action a value = Q(s,a) # if this action offers an improvement, store it if value > best then best = value p(s) = a end if end for # determine if this iteration was stable if previous != p(s) then stable = false end for if stable then halt
The policy iteration terminates when no changes are made to the policy. Note that throughout the iterations, only one array is used; V is necessary for storing state value estimates. In some implementations, the next iteration is stored in a separate array. This requires more memory and takes longer to converge, so one array is often sufficient.
Value iteration is a simpler algorithm because it does not interleave phases. Both are handled together by evaluating and improving state value estimates at the same time. The policy is only computed after the value iteration has finished.
The pseudo-code in Listing 46.4 assumes that all the estimates of the state values in V are worse than the optimal value. (That is, V(s)V*(s).) This relatively easy to ensure, and simplifies the code somewhat. If this is not possible, we need to compare the action values among each other, and assign the best to the state value (instead of comparing the action values to the current state value). After the iteration has finished, a policy evaluation can generate the corresponding policy (which should be close to optimal).
repeat max = 0 # process all the states to improve the policy for each state s old = V(s) # check all the actions for possible improvements for each action a value = Q( s, a ) # update the estimate if necessary if value > V(s) then V(s) = value end for # remember the magnitude of the adjustment max = MAX(max, ABS(old – V(s))) end for until max < threshold
Monte Carlo Approach
Monte Carlo (MC) methods attempt to learn the optimal policy using experience, unlike the DP approaches that rely on a model of the problem. To do this, MC methods take an episodic approach, where multiple state/action sequences are processed.
The Monte Carlo technique relies on episodes to find the optimal state values. Because the outcome is known for each episode, the discounted return for each state can be computed exactly. However, we want to be able to take into account the probabilistic nature of the problem. For this we need to run through many episodes, and collect statistics for each of the runs.
Monte Carlo methods learn the expected return using these statistics. In fact, the average of the discounted returns from each episode is an accurate estimate of the state value. The more examples we have, the more accurate the final estimate. The optimal policy can then be derived from the estimated returns. Figure 46.6 shows another grid world example.
Figure 46.6. A training episode in a simple 2D world. The Monte Carlo technique backs up the expected return from the end of the episode, right to the start.
The pseudo-code in Listing 46.5 assumes we know about the transitions in the world model, so we only need to learn the state values. This is a simple example that is relatively common in games, and also is quicker to learn. A more ubiquitous variation could estimate the Q values for state/action pairs. This would just require replacing each occurrence of s with (s,a)—adapting the data structures accordingly.
The first stage of the algorithm is to generate an arbitrary episode given a policy. This returns a sequence of states, and the corresponding rewards encountered. The arrays containing the expected return are initialized to 0 (see Listing 46.5).
Listing 46.5 A Single Pass of the Monte Carlo Approach (The episode has already been generated using a variation of the policy p.)
h is the length of the episode state is the array of states representing the episode reward is the reward collected at each time step expected is an array containing the sum of estimated returns total is an array of integers counting the number of estimates return = 0 # start at the end of the episode and traverse backwards for t from h down to 1 # determine the current state s = state[t] # accumulate the return and add the reward return = reward[t] + return * discount # collect statistics for the return of s total[s]++ expected[s] += return # and store the current estimate V(s) = expected[s] / total[s] end for
One question arises: How are the episodes generated? The key concept is to make sure that all the state/action pairs are traversed. This is important to ensure optimality. One way to do this is known as exploring starts. The idea is that we can generate episodes according to any policy, as long as the first state/action pair is chosen randomly (again, with a good random-number generator).
This is proven to work, but it can be a bit slow. A good policy is to start with random exploration at first, and then decrease the randomness as the results start to improve. This is similar to controlling the temperature with the Boltzman equation in simulated annealing (discussed in Chapter 17).
In terms of RL algorithms, the Monte Carlo method can be classified as follows:
Monte Carlo is well suited to problems where no world model is available. The simulation is sufficient to gather statistics. It can take some time to go through enough episodes. Also, some additional memory is needed to store an episode while it is being processed. This makes the approach well suited as an offline process. It is also simple to use on local areas of the state space, so it can be combined elegantly with other approaches.
The notion of temporal difference is at the heart of the solution. Instead of waiting until the end of an episode to adjust the estimate, we use the estimates in the next state. From one point in time to another, there may be a difference in the expected return. We can use this temporal difference (Dt) to update the estimate:
The second equation expresses that the current reward V(st) can be adjusted by the step Dt scaled by the constant a (alpha). The step Dt itself is the adjustment toward the target estimate; that's the discounted return estimate gV(st+1) of the next state and the reward rt+1 collected during the change of state.
The advantage of this approach is that we can learn the policy online, not using an exhaustive dynamic programming approach. This also implies we don't need a world model. The expected rewards are integrated in the learning, and the transition probabilities can be optionally learned (but can be assumed, too).
There are two simple algorithms using temporal-difference learning. Sarsa is on-policy because the algorithm relies on the same policy for learning and exploration. On the other hand, Q-learning is off-policy because the chosen actions do not affect the learning. (Behavior and control policy can be considered separate.)
There's very little difference between the two algorithms. Both algorithms learn the action values for each state (that is, the Q values). Both use a backup of one deep. However, Sarsa uses a sample backup (only using one Q value to update estimate), whereas Q-learning uses a full backup (all the Q values from the next state are used to update the current estimate). We'll study the Q-learning algorithm because it is the most commonly used, offers better quality learning, and does not rely on a specific exploration policy (see Listing 46.6).
Listing 46.6 The Q-Learning Algorithm That Improves the Estimate of the State/Values Using Full Backup
a is a small constant as the learning rate g is the discount factor while learning execute action a collect reward r determine new state s' best = – # scan each of the actions from the new state for each a' in s' # remember the best only if Q(s',a') > best then best = Q(s',a') end if # update the value of the previous state/action pair Q(s,a) += a * (r + g * best—Q(s,a)) end for # process the next state s = s' end while
It's possible to estimate the state values in a very similar fashion, although the algorithm would rely on the transition probabilities from the world model. The learning may be quicker because the Q-values collapse into one state value, but less precise.