The implementation is covered in the next section. It supports both categorical and continuous variables (that is, bool, int, and float). An unlimited number of predictor variables are possible (decided at initialization), but only one response variable.
The main purpose of the initialization is to describe what kind of data the DT has to deal with. The module requires the types of each of the variables expected. The range of continuous variables may also be greatly appreciated by the underlying algorithm, as well as suggested discretization:
<predictor name="health" type="integer" min="0" max="100" /> <predictor name="distance" type="real" min="0" step="36" /> <predictor name="visible" type="boolean" /> <response name="fitness" type="real" min="0" max="1" step="0.1" />
This information allows the DT to prepare its internal representation as much as possible before the data is provided by a runtime interface.
The DT itself will undoubtedly need to be saved to disk, and then reloaded later. Because DTs have a relatively intuitive representation, we may want to modify it manually—which requires explaining the storage format. The structure of DTs is well-suited to be represented in XML, because hierarchies of nodes are simple to express. The basic outline of a DT is shown in Listing 27.1. The DT is built up by a recursive definition of all the nodes. The decision tag specifies the criteria for choosing the branch, and the match tag explains the required result of the evaluation for this node to be chosen.
<Node> <decision> ... </decision> <branch> <match> ... </match> <Node> ... </Node> </branch> </Node>
However, defining the decisions and branches requires fundamental assumptions about the nature of the implementation. For example, is there a limit to the number of attributes? Are there restrictions on possible branches? Does the implementation use other pattern-recognition techniques to perform individual decisions?
Supporting different types of predictor variables makes the interface tricky to specify. Indeed, one generic interface for all the possible types is ideal, but this requires some form of polymorphism. To specify the interface in a flexible fashion, we assume a new data type called any is specified:
any Predict(const vector< any >& inputs);
This function is the main entry point of the DT module. It takes a set of values (for each of the predictor variables), and produces a response on the output. The data types will be agreed upon during initialization. The any needs to be capable of representing different types, such as int, bool, and float. To do this, we can either use a C union, void pointers, or inheritance. A safe existing approach is to use the boost::any facility, which also handles conversions in an elegant fashion. The other solutions could be interchangeably used without modifying the code, as long as assignment and type casting operators are available.
For the learning, two different interfaces are defined for batch and incremental learning. Using the same approach as for perceptrons, different implementations require varying amounts of persistent memory, so keeping them separate reduces consumption. The developer will then have memory costs depending on the interfaces chosen (rather than the same overhead upfront):
float Increment(const Sample& sample); float Batch(const vector< Sample >& samples);
The two functions return the floating-point error for learning the training samples provided. A Sample is a simple data-structure that stores a vector of any values for the predictors, and the desired any response value.