General Classification and Regression Trees
Introductory Overview - Basic Ideas
Overview
The Stastica General Classification
and Regression Trees module (GC&RT)
will build classification and regression trees for predicting continuous
dependent
variables (regression) and categorical predictor variables (classification).
The program supports the classic C&RT algorithm popularized by Breiman
et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley,
1996), and includes various methods for pruning and cross-validation,
as well as the powerful v-fold cross-validation methods. In addition,
with this program you can specify ANCOVA-like experimental designs (see
MANOVA
and GLM)
with continuous and categorical factor effects and interactions, i.e.,
to base the computations on design matrices for the predictor variables.
A general introduction to tree-classifiers, specifically to the QUEST
(Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented
in the context of the Classification
Trees Analysis facilities, and much of the following discussion
presents the same information, in only a slightly different context. Another,
similar type of tree building algorithm is CHAID (Chi-square Automatic
Interaction Detector; see Kass, 1980); a full implementation of this algorithm
is available in the General
CHAID Models module of Stastica.
Classification and Regression
Problems
Stastica contains numerous algorithms for predicting continuous variables
or categorical variables from a set of continuous predictors and/or categorical
factor effects. For example, in GLM (General Linear Models) and GRM
(General Regression Models), you can specify a linear combination
(design) of continuous predictors and categorical
factor effects (e.g., with two-way and three-way interaction effects)
to predict a continuous dependent variable. In GDA (General Discriminant Function Analysis),
you can specify such designs for predicting categorical variables, i.e.,
to solve classification problems.
Regression-type
problems. Regression-type problems are generally those where one
attempts to predict the values of a continuous variable from one or more
continuous and/or categorical
predictor variables. For example, you may want to predict the selling
prices of single family homes (a continuous dependent
variable) from various other continuous predictors
(e.g., square footage) as well as categorical predictors (e.g., style
of home, such as ranch, two-story, etc.; zip code or telephone area code
where the property is located, etc.; note that this latter variable would
be categorical in nature, even though it would contain numeric values
or codes). If you used simple multiple
regression, or some general linear model (GLM) to
predict the selling prices of single family homes, you would determine
a linear equation for these variables that can be used to compute predicted
selling prices. STATISTICA contains
many different analytic procedures for fitting linear models (GLM, GRM,
Regression),
various types of nonlinear models (e.g., Generalized
Linear/Nonlinear Models (GLZ), Generalized
Additive Models (GAM), etc.), or completely custom-defined
nonlinear models (see Nonlinear
Estimation), where you can type in an arbitrary equation containing
parameters to be estimated by the program. The General CHAID Models
(GCHAID) module of STATISTICA
also analyzes regression-type problems, and produces results that are
similar (in nature) to those computed by GC&RT.
Note that various neural network architectures [available in Statstica Automated
Neural Networks (SANN)] are also applicable to solve regression-type
problems.
Classification-type
problems. Classification-type problems are generally those where
one attempts to predict values of a categorical dependent
variable (class, group membership, etc.) from one or more continuous and/or
categorical
predictor variables. For example, you may be interested in predicting
who will or will not graduate from college, or who will or will not renew
a subscription. These would be examples of simple binary classification
problems, where the categorical dependent variable can only assume two
distinct and mutually exclusive values. In other cases one might be interested
in predicting which one of multiple different alternative consumer products
(e.g., makes of cars) a person decides to purchase, or which type of failure
occurs with different types of engines. In those cases there are multiple
categories or classes for the categorical dependent variable. Stastica
contains a number of methods for analyzing classification-type problems
and to compute predicted classifications, either from simple continuous
predictors (e.g., binomial or multinomial logit regression in GLZ),
from categorical predictors (e.g., Log-Linear
analysis of multi-way frequency tables), or both (e.g., via ANCOVA-like
designs in GLZ or GDA).
The General
CHAID Models module of STATISTICA
also analyzes classification-type problems, and produces results that
are similar (in nature) to those computed by GC&RT.
Note that various neural network architectures [available in STATISTICA Automated
Neural Networks (SANN)] are also applicable to solve classification-type
problems.
Classification and Regression
Trees (C&RT)
In most general terms, the purpose of the analyses via tree-building
algorithms is to determine a set of if-then
logical (split) conditions that permit accurate prediction or classification
of cases.
Classification Trees
For example, consider the widely referenced Iris data classification
problem introduced by Fisher [1936; see also Discriminant
Function Analysis and General
Discriminant Analysis (GDA)]. The example data file Irisdat.sta
reports the lengths and widths of sepals and petals of three types of
irises (Setosa, Versicol, and Virginic). The purpose of the analysis is
to learn how one can discriminate between the three types of flowers,
based on the four measures of width and length of petals and sepals. Discriminant
function analysis will estimate several linear combinations of predictor
variables for computing classification scores (or probabilities) that
allow the user to determine the predicted classification for each observation
(see also the Discriminant
Function Analysis Example for a discussion of this type of analysis).
A classification tree will determine a set of logical if-then conditions
(instead of linear equations) for predicting or classifying cases instead:

The interpretation of this tree is straightforward: If the petal width
is less than or equal to 0.8, the respective flower would be classified
as Setosa; if the petal width is greater than 0.8 and less than or equal
to 1.75, then the respective flower would be classified as Versicol; else,
it belongs to class Virginic.
Regression Trees
The general approach to derive predictions from few simple if-then conditions
can be applied to regression problems as well. For example, refer to Example
1 of the Multiple
Regression Analysis module (see also Example
2 for GC&RT). Example
1 is based on the data file Poverty.sta,
which contains 1960 and 1970 Census figures for a random selection of
30 counties. The research question (for that example) was to determine
the correlates of poverty, that is, the variables that best predict the
percent of families below the poverty line in a county. A reanalysis of
those data, using the regression tree analysis [and v-fold cross-validation
(see Introductory Overview - Basic Ideas
Part II) to determine the best tree; see also, Example
2], yields the following results:

Again, the interpretation of these results is rather straightforward:
Counties where the percent of households with a phone is greater than
72% have generally a lower poverty rate. The greatest poverty rate is
evident in those counties that show less than (or equal to) 72% of households
with a phone, and where the population change (from the 1960 census to
the 1970 census) is less than -8.3 (minus 8.3). These results are straightforward,
easily presented, and intuitively clear as well: There are some affluent
counties (where most households have a telephone), and those generally
have little poverty. Then there are counties that are generally less affluent,
and among those the ones that shrunk most showed the greatest poverty
rate.
A quick review of the scatterplot of observed vs. predicted values shows
how the discrimination between the latter two groups is particularly well
"explained" by the tree model.

See Basic Ideas Part II.
Note that there are four different types of tree-building algorithms
available in STATISTICA: CHAID (Kass, 1980; see General
CHAID Introductory Overview), C&RT
(Breiman, Friedman, Olshen, and Stone, 1984; see General Classification and Regression Trees),
QUEST (Loh and Shih, 1997; see Classification
Trees Analysis), and Interactive
C&RT and CHAID Trees; see also CHAID,
C&RT, and QUEST for additional details. For additional discussions
of differences between different computational algorithms, see also the
Interactive
Trees (C&RT, CHAID) Introductory Overview and Missing
Data in GC&RT, GCHAID, and Interactive Trees.