The process of computing classification and regression trees can be characterized as involving four basic steps:

Specifying the criteria for predictive accuracy

Selecting splits

Determining when to stop splitting

Selecting the "right-sized" tree.

These steps are very similar to those discussed in the context of the of the Classification Trees Analysis facilities (see in particular the Introductory Overview and Computational Methods; see also Breiman et al., 1984, for more details). See also, Computational Formulas.

Specifying the Criteria for Predictive Accuracy

The classification and regression trees (C&RT) algorithms are generally aimed at achieving the best possible predictive accuracy. Operationally, the most accurate prediction is defined as the prediction with the minimum costs. The notion of costs was developed as a way to generalize, to a broader range of prediction situations, the idea that the best prediction has the lowest misclassification rate. In most applications, the cost is measured in terms of proportion of misclassified cases, or variance. In this context, it follows, therefore, that a prediction would be considered best if it has the lowest misclassification rate or the smallest variance. The need for minimizing costs, rather than just the proportion of misclassified cases, arises when some predictions that fail are more catastrophic than others, or when some predictions that fail occur more frequently than others.

Priors. In the case of a categorical
response (classification problem), minimizing costs amounts to minimizing
the proportion of misclassified cases when priors are taken to be proportional
to the class sizes and when misclassification costs are taken to be equal
for every class.

The a priori probabilities used in minimizing costs can greatly affect the classification of cases or objects. Therefore, care has to be taken while using the priors. If differential base rates are not of interest for the study, or if one knows that there are about an equal number of cases in each class, then one would use equal priors. If the differential base rates are reflected in the class sizes (as they would be, if the sample is a probability sample), then one would use priors estimated by the class proportions of the sample. Finally, if you have specific knowledge about the base rates (for example, based on previous research), then one would specify priors in accordance with that knowledge The general point is that the relative size of the priors assigned to each class can be used to "adjust" the importance of misclassifications for each class. However, no priors are required when one is building a regression tree.

Misclassification costs. Sometimes
more accurate classification of the response is desired for some classes
than others for reasons not related to the relative class sizes. If the
criterion for predictive accuracy is Misclassification costs, then minimizing
costs would amount to minimizing the proportion of misclassified cases
when priors are considered proportional to the class sizes and misclassification
costs are taken to be equal for every class.

Case weights. In STATISTICA General Classification and Regression Trees, case weights are treated strictly as case multipliers. For example, the misclassification rates from an analysis of an aggregated data set using case weights will be identical to the misclassification rates from the same analysis where the cases are replicated the specified number of times in the data file.

However, note that the use of case weights for aggregated data sets
in classification problems is related to the issue of minimizing costs.
Interestingly, as an alternative to using case weights for aggregated
data sets, one could specify appropriate priors and/or misclassification
costs and produce the same results while avoiding the additional processing
required to analyze multiple cases with the same values for all variables.
Suppose that in an aggregated data set with two classes having an equal
number of cases, there are case weights of 2 for all cases in the first
class, and case weights of 3 for all cases in the second class. If you
specified priors of .4 and .6, respectively, specified equal misclassification
costs, and analyzed the data without case weights, you will get the same
misclassification rates as you would get if you specified priors estimated
by the class sizes, specified equal misclassification costs, and analyzed
the aggregated data set using the case weights. You would also get the
same misclassification rates if you specified priors to be equal, specified
the costs of misclassifying class 1 cases as class 2 cases to be 2/3 of
the costs of misclassifying class 2 cases as class 1 cases, and analyzed
the data without case weights.

Selecting Splits

The second basic step in classification and regression trees is to select the splits on the predictor variables that are used to predict membership in classes of the categorical dependent variables, or to predict values of the continuous dependent (response) variable. In general terms, the program will find at each node the split that will generate the greatest improvement in predictive accuracy. This is usually measured with some type of node impurity measure, which provides an indication of the relative homogeneity (the inverse of impurity) of cases in the terminal nodes. If all cases in each terminal node show identical values, then node impurity is minimal, homogeneity is maximal, and prediction is perfect (at least for the cases used in the computations; predictive validity for new cases is of course a different matter...).

For classification problems, STATISTICA GC&RT gives the user the choice of several impurity measures: The Gini index, Chi-square, or G-square. The Gini index of node impurity is the measure most commonly chosen for classification-type problems. As an impurity measure, it reaches a value of zero when only one class is present at a node. With priors estimated from class sizes and equal misclassification costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes present at the node; it reaches its maximum value when class sizes at the node are equal; the Gini index is equal to zero if all cases in a node belong to the same class. The Chi-square measure is similar to the standard Chi-square value computed for the expected and observed classifications (with priors adjusted for misclassification cost), and the G-square measure is similar to the maximum-likelihood Chi-square (as for example computed in the Log-Linear module). For regression-type problems, the program automatically uses a least-squares deviation criterion (similar to what is computed in least squares regression). Computational Formulas provides further computational details.

Determining When to Stop Splitting

As discussed in Basic Ideas Part II, in principle, splitting could continue until all cases are perfectly classified or predicted. However, this wouldn't make much sense since one would likely end up with a tree structure that is as complex and "tedious" as the original data file (with many nodes possibly containing single observations), and that would most likely not be very useful or accurate for predicting new observations. What is required is some reasonable stopping rule. In the GC&RT module, two options are available that can be used to keep a check on the splitting process; namely Minimum n and Fraction of objects.

Minimum n. One way to control
splitting is to allow splitting to continue until all terminal nodes are
pure or contain no more than a specified minimum number of cases or objects.
In the GC&RT module this
is done by using the option Minimum
n that allows you to specify the desired minimum number of cases
as a check on the splitting process. This option can be used when Prune on misclassification error, Prune on deviance, or Prune
on variance is active as the Stopping
rule for the analysis.

Fraction of objects. Another way to control splitting is to allow splitting to continue until all terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of one or more classes (in the case of classification problems, or all cases in regression problems). This option can be used when FACT-style direct stopping has been selected as the Stopping rule for the analysis. In the GC&RT module, the desired minimum fraction can be specified as the Fraction of objects. For classification problems, if the priors used in the analysis are equal and class sizes are equal as well, then splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction of the class sizes for one or more classes. Alternatively, if the priors used in the analysis are not equal, splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction for one or more classes. See Loh and Vanichestakul, 1988 for details.

Pruning and Selecting the "Right-Sized" Tree

The size of a tree in the classification and regression trees analysis
is an important issue, since an unreasonably big tree can only make the
interpretation of results more difficult. Some generalizations can be
offered about what constitutes the "right-sized" tree. It should
be sufficiently complex to account for the known facts, but at the same
time it should be as simple as possible. It should exploit information
that increases predictive accuracy and ignore information that does not.
It should, if possible, lead to greater understanding of the phenomena
it describes. The options available in the GC&RT
module allow the use of either, or both, of two different strategies for
selecting the "right-sized" tree from among all the possible
trees. One strategy is to grow the tree to just the right size, where
the right size is determined by the user, based on the knowledge from
previous research, diagnostic information from previous analyses, or even
intuition. The other strategy is to use a set of well-documented, structured
procedures developed by Breiman et al. (1984) for selecting the "right-sized"
tree. These procedures are not foolproof, as Breiman et al. (1984) readily
acknowledge, but at least they take subjective judgment out of the process
of selecting the "right-sized" tree.

FACT-style direct stopping.
We will begin by describing the first strategy, in which the user specifies
the size to grow the tree. This strategy is followed by selecting FACT-style
direct stopping as the stopping rule for the analysis, and by specifying
the Fraction of objects which allows the tree to grow to the desired size.
The GC&RT module provides several options for obtaining diagnostic
information to determine the reasonableness of the choice of size for
the tree. Specifically, three options are available for performing cross-validation
of the selected tree; namely Test sample, V-fold, and Minimal cost-complexity.

Test sample cross-validation.
The first, and most preferred type of cross-validation is the test sample
cross-validation.

In the GC&RT module, test sample cross-validation is performed by specifying a sample identifier variable which contains codes for identifying the sample (learning or test) to which each case or object belongs. See the description of the Validation tab of the Quick specs dialog for additional details.

V-fold cross-validation. The
second type of cross-validation available in the GC&RT
module is V-fold cross-validation. This type of cross-validation is useful
when no test sample is available and the learning sample is too small
to have the test sample taken from it. The user-specified 'v' value for
v-fold cross-validation (its default value is 3) determines the number
of random subsamples, as equal in size as possible, that are formed from
the learning sample. A tree of the specified size is computed 'v'

Minimal cost-complexity cross-validation pruning. In the GC&RT module, minimal cost-complexity cross-validation pruning is performed, if Prune on misclassification error has been selected as the Stopping rule. On the other hand, if Prune on deviance has been selected as the Stopping rule, then minimal deviance-complexity cross-validation pruning is performed. The only difference in the two options is the measure of prediction error that is used. Prune on misclassification error uses the costs that equals the misclassification rate when priors are estimated and misclassification costs are equal, while Prune on deviance uses a measure, based on maximum-likelihood principles, called the deviance (see Ripley, 1996). For details about the algorithms used in the GC&RT module to implement Minimal cost-complexity cross-validation pruning, see also the Introductory Overview and Computational Methods sections of the Classification Trees Analysis module.

The sequence of trees obtained by this algorithm have a number of interesting
properties. They are nested, because the successively pruned trees contain
all the nodes of the next smaller tree in the sequence. Initially, many
nodes are often pruned going from one tree to the next smaller tree in
the sequence, but fewer nodes tend to be pruned as the root node is approached.
The sequence of largest trees is also optimally pruned, because for every
size of tree in the sequence, there is no other tree of the same size
with lower costs. Proofs and/or explanations of these properties can be
found in Breiman et al. (1984).

Tree selection after pruning.
The pruning, as discussed above, often results in a sequence of optimally
pruned trees. So the next task is to use an appropriate criterion to select
the "right-sized" tree from this set of optimal trees. A natural
criterion would be the CV costs (cross-validation costs). While there
is nothing wrong with choosing the tree with the minimum CV costs as the
"right-sized" tree, oftentimes there will be several trees with
CV costs close to the minimum. Following

As can be been seen, minimal cost-complexity cross-validation pruning and subsequent "right-sized" tree selection is a truly "automatic" process. The algorithms make all the decisions leading to the selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. Regardless of how the tree was constructed or pruned, STATISTICA GC&RT includes an option on the results dialog for v-fold cross-validation (on the Summary tab of the Results dialog) for computing the cross-validation cost for each tree in the final tree sequence (if it hasn't been computed already, i.e., selected from the Validation tab of the GC&RT Quick specs dialog). This option allows you to evaluate how well each tree "performs" when repeatedly cross-validated in different samples randomly drawn from the data. Note that you can also review the complete results for every single tree in the final tree sequence, to evaluate in detail how the final tree chosen by the program compares to others that perhaps produce similar misclassification costs at higher complexity (i.e., with more terminal nodes).

Note: Missing data. Missing
data in predictor variables are handled differently in the GC&RT
module, as compared to