Computational
Formulas
In STATISTICA General
Classification and Regression Trees, estimates of accuracy are
computed by different formulas for categorical and continuous dependent
variables (classification and regression-type problems). For classification-type
problems (categorical dependent variable) accuracy is measured in terms
of the true classification rate of the classifier, while in the case of
regression (continuous dependent variable) accuracy is measured in terms
of mean squared error of the predictor.
In addition to measuring accuracy, the following measures of node impurity
are used for classification problems: The Gini measure, generalized Chi-square
measure, and generalized G-square measure. The Chi-square measure is similar
to the standard Chi-square value computed for the expected and observed
classifications (with priors adjusted for misclassification cost), and
the G-square measure is similar to the maximum-likelihood Chi-square (as
for example computed in the Log-Linear
module). The Gini measure is the one most often used for measuring purity
in the context of classification problems, and it is described below.
For continuous dependent variables (regression-type problems), the least
squared deviation (LSD) measure of impurity is automatically applied.
Estimation of Accuracy in
Classification
In classification problems (categorical dependent variable), three estimates
of the accuracy are used: resubstitution estimate, test sample estimate,
and v-fold cross-validation. These estimates are defined here.
Resubstitution estimate. Resubstitution
estimate is the proportion of cases that are misclassified by the classifier
constructed from the entire sample. This estimate is computed in the following
manner:
where X
is the indicator function;
X = 1, if the statement
is true
X = 0, if the statement
is false
and d (x)
is the classifier.
The resubstitution estimate is computed using the same data as used
in constructing the classifier d
.
Test sample estimate. The total
number of cases are divided into two subsamples Z1
and Z2.
The test sample estimate is the proportion of cases in the subsample Z2, which are
misclassified by the classifier constructed from the subsample Z1. This estimate is
computed in the following way.
Let the learning sample Z
of size N
be partitioned into subsamples Z1 and Z2
of sizes N
and N2,
respectively.
where Z2
is the sub sample that is not used for constructing the classifier.
v-fold cross-validation. The
total number of cases are divided into v
sub samples Z1,
Z2,
..., Zv
of almost equal sizes. v-fold cross validation estimate is the proportion
of cases in the subsample Z
that are misclassified by the classifier constructed from the subsample
Z - Zv.
This estimate is computed in the following way.
Let the learning sample Z
of size N
be partitioned into v
sub samples Z1,
Z2,
..., Zv
of almost sizes N1, N2,
..., Nv,
respectively.
where is computed
from the sub sample Z -
Zv
.
Estimation of Accuracy in
Regression
In the regression problem (continuous dependent variable) three estimates
of the accuracy are used: resubstitution estimate, test sample estimate,
and v-fold cross-validation. These estimates are defined here.
Resubstitution estimate. The
resubstitution estimate is the estimate of the expected squared error
using the predictor of the continuous dependent
variable. This estimate is computed in the following way.
where the learning sample Z
consists of (xi,yi),i
= 1,2,...,N.
The resubstitution estimate is computed using the same data as used in
constructing the predictor d
.
Test sample estimate. The total
number of cases are divided into two subsamples Z1
and Z2.
The test sample estimate of the mean squared error is computed in the
following way:
Let the learning sample Z
of size N
be partitioned into subsamples Z1 and Z2
of sizes N
and N2,
respectively.
where Z2
is the sub-sample that is not used for constructing the predictor.
v-fold cross-validation. The
total number of cases are divided into v
sub samples Z1,
Z2,
..., Zv
of almost equal sizes. The subsample Z
- Zv
is used to construct the predictor d.
Then v-fold cross validation estimate is computed
from the subsample Zv
in the following way:
Let the learning sample Z
of size N
be partitioned into v
sub samples Z1,
Z2,
..., Zv
of almost sizes N1, N2,
..., Nv,
respectively.
where is computed
from the sub sample Z - Zv
.
Estimation of Node Impurity:
Gini Measure
The Gini measure is the measure of impurity of a node and is commonly
used when the dependent variable is a categorical variable, defined as:
if costs of misclassification
or unequal prior probabilities are not specified,
if costs of misclassification or unequal prior
probabilities are specified,
where the sum extends over all k
categories. p(
j
/ t) is the probability of
category j
at the node t
and C(i /
j ) is the probability of misclassifying a category j
case as category i.
Note that the specification of equal or unequal prior probabilities
can greatly affect the accuracy of the final tree model for predicting
particular classes. For details, see Prior
Probabilities, the Gini Measure of Node Impurity, and Misclassification
Cost.
Estimation of Node
Impurity: Least-Squared Deviation
Least-squared deviation (LSD) is used as the measure of impurity of
a node when the response variable is continuous, and is computed as:
where Nw(t)
is the weighted number of cases in node t, wi
is the value of the weighting variable for case i,
fi
is the value of the frequency variable, yi
is the value of the response variable, and y(t)
is the weighted mean for node t.