Because the learning occurs at a very high level, the resulting behaviors are acceptable regardless of their optimality. So long as animats make consistent decisions, the tactics do not stand out (that is, no hesitating every other second). In the heat of the game, suboptimal moves often go unnoticed from the player's point of view. That said, some exploratory decisions are easily spotted by the trained eye, or an attentive spectator.
On the whole, learning has a tremendous effect on the performance because an animat that has undergone training will easily beat a randomly initialized animat. This is particularly surprising because the animats are mostly optimizing according to their moods—although they do have an implicit notion of performance.
Developing the reinforcement model takes at least as much time as manually crafting behaviors with a subsumption architecture. This is the case even if the learning algorithm has already been implemented. Reinforcement theory is surprisingly simple in code (barely more than equations), and most of the engineer's time is spent adjusting the state/action model with expert features.
Surprisingly, the reward signals do not require much adjustment. The reward signal generally defines the desired behavior implicitly, and the learning algorithm finds a more explicit way to achieve it. For this reason, the reward signal needs to be adjusted by the designer if the behavior is invalid (during the application phase). In this particular case, however, the final behavior is not precisely defined as an "optimal" strategy. Instead, we expect the behavior to emerge from the way the emotions were defined. This more flexible attitude implies that less work is needed, but it's also much harder to reproduce existing strategies with this approach (for example, if the designer wants the animats to exhibit a particular tactic).
Training offline is a good way to hide the learning phase from the human players. However, a minimal amount of validation is necessary to support online learning. Specifically, a good policy needs to be chosen that sacrifices exploration for the sake of exploitation. The policy can be mostly learned offline anyway, removing the need for exploratory moves that give the learning away.
Whether offline or online, the system is unlikely to get stuck in the same state doing the same actions. The emotions are driven independently and change over the course of time. When the animat performs similar behaviors over a small amount of time, the boredom will increase—among others. This means the current state for the reinforcement will change automatically, causing a different action to be chosen.
Despite the various details that must be adjusted and tested, the learning system has a tremendous advantage over the one designed manually: It can adapt to trends in the game. This produces varied gameplay and a more flexible AI that can deal with the enemy better. Because of the way the problem is modeled, however, only general tactics can be learned; no individual patterns can be learned. Reinforcement learning is statistical by nature but could deal with such a problem if the state were modeled comprehensively (that is, with a world model that includes individual patterns). This requires more memory, computation, and learning time, so a planning approach may be more appropriate to countering individual strategies.
From a technological point of view, Q-learning is really simple to implement because it requires no world model (matching the pseudo-code in Chapter 46). The Monte Carlo variation needs a little more code and memory to keep track of the states encountered during the fight. Any dynamic programming approach would take much more effort here, because the transitions and reward probabilities would need to be gathered beforehand.
The system benefits from being decomposed by capability, because learning is faster when dealing with compact representations. There is some redundancy between the movement and shooting capabilities because the states are the same, but keeping them separate allows them to be updated at different frequencies.