JavaScript EditorFree JavaScript Editor     Ajax Editor 

Main Page
  Previous Section Next Section

Reinforcement Learning Algorithms

There are three major types of algorithms based on mathematical models, statistics, or incremental learning.

Dynamic Programming Techniques

Essentially, there are many brute-force approaches—based on dynamic programming (DP)—for computing policies and state values. Despite being exhaustive, they prove surprisingly efficient. DP works in two phases; the first estimates the state values of the states based on the policy (policy evaluation), and the other improves the policy using the state values (policy improvement) [Sutton97]:

  • Policy evaluation essentially uses iteration to solve a large system of equations. These equations express each of the state values in terms of each other—using the recursive definition. The algorithm itself goes through all the states, and computes equation from Listing 46.1 exactly as described.

  • Policy improvement uses the state values to investigate better suggestions. Intuitively, checking for actions that lead to improved return does this. If the estimated action value is higher, this action improves the policy, so we use this action instead.

Taking a popular example with a 2D maze, the policy evaluation computes the distance of each square from the exit, and the policy improvement tries to change individual steps to reach the exit quicker.

One thing to note is that DP solutions rely on knowledge of transition probabilities graphics/p_a_ss.gif and expected rewards graphics/r_a_ss.gif (from state s to s' given action a). Let's look at two ways to combine these separate phases into algorithms for computing ideal policies.

Policy Iteration

This method works by alternating phases of policy evaluation and policy improvement. This is an iterative process—as the name suggests—which continues until a convergence criterion is met.

The initialization is done by assigning suitable estimates to states (for instance, 0) and randomly picking any policy. Then, phase 1 evaluates the policy by computing the most up-to-date state value (see Listing 46.2).

Listing 46.2 Knowing the World Model, Policy Iteration Can Be Used to Compute the Value of Each State (The loop terminates when the updates are small enough. The state value estimation function from Listing 46.1 is used.)
     max = 0
     for each state s
          old = V(s)
          # compute the new estimate for this state
          V(s) = state_value( s )
          # check if it was improved
          max = MAX(max, ABS(old—V(s)))
     end for
until max < threshold

Then, to improve the policy, we check all the actions to see whether there is a better one. Both algorithms in Listing 46.2 and 46.3 are within the same larger loop, interleaving computation until the policy is stable.

Listing 46.3 Policy Improvement Is Used to Scan Each State, Checking Whether There Is a Better Action to Be Taken
p is a deterministic policy suggesting one action per state
Q is the value for each state/action pair

stable = true
for each state s
     previous = (s)
     best = –
     # determine if any other action is better than the current
     for each action a
          value = Q(s,a)
          # if this action offers an improvement, store it
          if value > best then
               best = value
               p(s) = a
          end if
     end for
     # determine if this iteration was stable
     if previous != p(s) then stable = false
end for
if stable then halt

The policy iteration terminates when no changes are made to the policy. Note that throughout the iterations, only one array is used; V is necessary for storing state value estimates. In some implementations, the next iteration is stored in a separate array. This requires more memory and takes longer to converge, so one array is often sufficient.

Value Iteration

Value iteration is a simpler algorithm because it does not interleave phases. Both are handled together by evaluating and improving state value estimates at the same time. The policy is only computed after the value iteration has finished.

The pseudo-code in Listing 46.4 assumes that all the estimates of the state values in V are worse than the optimal value. (That is, V(s)V*(s).) This relatively easy to ensure, and simplifies the code somewhat. If this is not possible, we need to compare the action values among each other, and assign the best to the state value (instead of comparing the action values to the current state value). After the iteration has finished, a policy evaluation can generate the corresponding policy (which should be close to optimal).

Listing 46.4 Value Iteration Algorithms Improve and Estimate the State Values at the Same Time
max = 0
# process all the states to improve the policy
for each state s
     old = V(s)
     # check all the actions for possible improvements
     for each action a
          value = Q( s, a )
          # update the estimate if necessary
               if value > V(s) then V(s) = value
     end for
     # remember the magnitude of the adjustment
     max = MAX(max, ABS(old – V(s)))
end for
until max < threshold

Intuitively, this corresponds to updating the value of squares in 2D grid one by one, checking the neighbors for better paths.

Monte Carlo Approach

Monte Carlo (MC) methods attempt to learn the optimal policy using experience, unlike the DP approaches that rely on a model of the problem. To do this, MC methods take an episodic approach, where multiple state/action sequences are processed.


The Monte Carlo technique relies on episodes to find the optimal state values. Because the outcome is known for each episode, the discounted return for each state can be computed exactly. However, we want to be able to take into account the probabilistic nature of the problem. For this we need to run through many episodes, and collect statistics for each of the runs.

Monte Carlo methods learn the expected return using these statistics. In fact, the average of the discounted returns from each episode is an accurate estimate of the state value. The more examples we have, the more accurate the final estimate. The optimal policy can then be derived from the estimated returns. Figure 46.6 shows another grid world example.

Figure 46.6. A training episode in a simple 2D world. The Monte Carlo technique backs up the expected return from the end of the episode, right to the start.



The pseudo-code in Listing 46.5 assumes we know about the transitions in the world model, so we only need to learn the state values. This is a simple example that is relatively common in games, and also is quicker to learn. A more ubiquitous variation could estimate the Q values for state/action pairs. This would just require replacing each occurrence of s with (s,a)—adapting the data structures accordingly.

The first stage of the algorithm is to generate an arbitrary episode given a policy. This returns a sequence of states, and the corresponding rewards encountered. The arrays containing the expected return are initialized to 0 (see Listing 46.5).

Listing 46.5 A Single Pass of the Monte Carlo Approach (The episode has already been generated using a variation of the policy p.)
h is the length of the episode
state is the array of states representing the episode
reward is the reward collected at each time step
expected is an array containing the sum of estimated returns
total is an array of integers counting the number of estimates

return = 0
# start at the end of the episode and traverse backwards
for t from h down to 1
     # determine the current state
     s = state[t]
     # accumulate the return and add the reward
     return = reward[t] + return * discount
     # collect statistics for the return of s
     expected[s] += return
     # and store the current estimate
     V(s) = expected[s] / total[s]
end for

The main loop of the algorithm starts at the end of the episode and runs through each of the states. A running total of the return is kept, and the estimates of each of the states is updated.

Episode Generation

One question arises: How are the episodes generated? The key concept is to make sure that all the state/action pairs are traversed. This is important to ensure optimality. One way to do this is known as exploring starts. The idea is that we can generate episodes according to any policy, as long as the first state/action pair is chosen randomly (again, with a good random-number generator).

This is proven to work, but it can be a bit slow. A good policy is to start with random exploration at first, and then decrease the randomness as the results start to improve. This is similar to controlling the temperature with the Boltzman equation in simulated annealing (discussed in Chapter 17).


In terms of RL algorithms, the Monte Carlo method can be classified as follows:

  • No bootstrapping— The estimates of the state value are computed using statistics over many runs, so we don't rely on estimated returns of neighboring states.

  • Sample backup— The algorithm does not traverse all the adjacent states. Instead, the randomized episodes imply one arbitrary path is chosen through the states.

  • Full-depth backup— MC must wait for the outcome of the episode to determine the return. This return actually includes all the states. When the state values are updated, they take into account the backup all the way from the end state.

Monte Carlo is well suited to problems where no world model is available. The simulation is sufficient to gather statistics. It can take some time to go through enough episodes. Also, some additional memory is needed to store an episode while it is being processed. This makes the approach well suited as an offline process. It is also simple to use on local areas of the state space, so it can be combined elegantly with other approaches.

Temporal-Difference Learning

Temporal differences can be seen as a hybrid approach between DP and MC. It can learn from experience using bootstrapping to estimate the value of states, and benefit from backup to learn quicker.


The notion of temporal difference is at the heart of the solution. Instead of waiting until the end of an episode to adjust the estimate, we use the estimates in the next state. From one point in time to another, there may be a difference in the expected return. We can use this temporal difference (Dt) to update the estimate:


The second equation expresses that the current reward V(st) can be adjusted by the step Dt scaled by the constant a (alpha). The step Dt itself is the adjustment toward the target estimate; that's the discounted return estimate gV(st+1) of the next state and the reward rt+1 collected during the change of state.

The advantage of this approach is that we can learn the policy online, not using an exhaustive dynamic programming approach. This also implies we don't need a world model. The expected rewards are integrated in the learning, and the transition probabilities can be optionally learned (but can be assumed, too).


There are two simple algorithms using temporal-difference learning. Sarsa is on-policy because the algorithm relies on the same policy for learning and exploration. On the other hand, Q-learning is off-policy because the chosen actions do not affect the learning. (Behavior and control policy can be considered separate.)

There's very little difference between the two algorithms. Both algorithms learn the action values for each state (that is, the Q values). Both use a backup of one deep. However, Sarsa uses a sample backup (only using one Q value to update estimate), whereas Q-learning uses a full backup (all the Q values from the next state are used to update the current estimate). We'll study the Q-learning algorithm because it is the most commonly used, offers better quality learning, and does not rely on a specific exploration policy (see Listing 46.6).

Listing 46.6 The Q-Learning Algorithm That Improves the Estimate of the State/Values Using Full Backup
a is a small constant as the learning rate
g is the discount factor

while learning
     execute action a
     collect reward r
     determine new state s'
     best = –
     # scan each of the actions from the new state
     for each a' in s'
          # remember the best only
          if Q(s',a') > best then
               best = Q(s',a')
          end if
          # update the value of the previous state/action pair
          Q(s,a) += a * (r + g * best—Q(s,a))
     end for
     # process the next state
     s = s'
end while

It's possible to estimate the state values in a very similar fashion, although the algorithm would rely on the transition probabilities from the world model. The learning may be quicker because the Q-values collapse into one state value, but less precise.

      Previous Section Next Section

    JavaScript EditorAjax Editor     JavaScript Editor