This example illustrates the use of Random
Forests for classification tasks (i.e., tasks that require associating
a data case of the predictor variables with a class type among a group
of categorical levels present in the dependent variable). Further examples
on the use of similar models, such as Boosted
Trees, can be found in Example
1: Classification via Boosting Trees.
Data file. This
example illustrates an analysis of the Boston housing data (Harrison &
Rubinfeld, 1978) that was reported by Lim, Loh, and Shih (1997). Median
prices of housing tracts are classified as LOW,
MEDIUM, or HIGH
on the dependent variable PRICE.
There is 1 categorical predictor, Cat1,
and 12 continuous predictors, ORD1
through ORD12. A duplicate of
the learning sample is used as a test sample. The sample identifier variable
is SAMPLE and contains LEARNING and TEST
as text. The complete data set containing 1,012 cases is available in
the example data file Boston2.sta.
This data file can be accessed via the File
- Open Examples menu; it is in the Datasets
folder. Part of this data file is shown below as an example.
The purpose of this example is to demonstrate the use of the STATISTICA
Random Forest module for classification type analyses. Our task
is to correctly identify the class labels of each data case using the
Random Forest model that we will
build in this analysis. In other words, given a set of predictor values,
we want to correctly categorize the price of a house in the Boston area
as either LOW, MEDIUM
Specifying the Analysis
After opening the Boston2.sta data
file, select Random Forest for Classification
and Regression from the Data
Mining menu to display the Random
Forest Startup Panel.
In the Type of analysis list
on the Quick tab, select Classification Analysis, and click the OK
button to display the Random
Forest Specifications dialog where you can configure the options
for running the analysis.
On the Quick
tab, click the Variables
button to display the variable
selection dialog. Note that not all the variables are displayed in
all the variable type lists (dependent, categorical, continuous, and count).
This is because STATISTICA pre-screens
the list of variables so that you are prompted to select only from among
those that are appropriate for the respective analysis. This feature is
particularly useful when the number of variables in the data set is large.
However, you can switch off pre-screening by clearing the Show
appropriate variables only check box. See Select
Variables for further details.
Clear the Show appropriate variables
only check box, and select variable PRICE
as the Dependent variable, variable
CAT1 as a Categorical
pred variable, and variables ORD1-ORD12
as the Continuous pred variables.
Click the OK button to accept
these selections, close the variable selection dialog, and return to the
Forest Specifications dialog.
There are a number of additional options available on the Classification,
Condition tabs of this dialog that can be reconfigured to "fine-tune"
Often, the cost of misclassification depends on if
category A is misclassified as B. In other words, the cost of misclassifying
A as B
may be substantially different from the cost of misclassifying B
as A. Setting values to misclassification
costs help you to accounts for such differences. For example, if you were
planning to buy a property, then perhaps misclassifying a HIGH
house price as LOW is more costly
than misclassifying a LOW house
price as HIGH. In the former
case, you will be simply over-estimating the property value. It should
be noted that assigning misclassification costs is subjective by nature
and may very well depend on the specific analysis.
By default, Random Forests
assigns equal misclassifications costs to all categories. To change this
setting, select the Classification tab.
Then, select the User spec.
option button in the Misclassifications
costs group box, and click the adjacent button to display a user
defined input spreadsheet, which is used to adjust the cost values.
(Note that, for this option to be available, response codes must be assigned
via the Response codes option
on the Quick tab).
Prior probabilities. On the
tab, you can also specify a
priori probabilities of class memberships. Select the User
specified option button in the Prior
Probabilities group box, and click the adjacent button to display the Enter
values for the prior probabilities dialog. (Note that, for
this option to be available, response codes must be assigned via the Response codes option on the Quick tab).
Prior probabilities should reflect the degree of belief in class memberships
of data cases (i.e., whether it belongs to category LOW,
MEDIUM, or HIGH)
before performing any analysis. One way to set such probabilities is to
consider the percentage of each category in the data set. This is a reasonable
approach if the data set is a good representative of the true population.
On the other hand, if no such information is available, then you can assign
equal prior probabilities to all categories. Note that assigning equal
priors means "I don't know," which simply reflects your lack
of knowledge of the percentage of house categories in the Boston area.
Note that priors, as all probabilities, must sum to unity.
On the Advanced
tab, you can access options to control the number and complexity (number
of nodes) of the tree models you are about to create.
The Sampling methods.
By default, the Random Forest
module partitions the data into training and testing samples using random
selection of cases from the data set. While the training sample is used
to build the model (add simple trees) the testing set is used to validate
its performance. This performance is used as the goodness of the model,
which for classification tasks is simply defined as the misclassification
rate. By default, Random Forest
selects 30% of the data set as test cases.
Instead of randomly partitioning the data set into training and test
cases, you can define your holdout (testing) sample via the Test
sample option, where you can identify a sample identifier code
to divide the data into training and testing sets. Selecting this sampling
method will override the random sampling option.
Number of predictor
variables. One of the advantages of the STATISTICA
Random Forest is the ability to perform predictions based on a
partial number (subset) of the predictor variables. This feature is particularly
attractive for data sets with an extremely large number of predictors.
In particular, you can specify the number of predictor variables you
want to include in your tree models. This option is an important one,
and care should be taken in setting its value. Including a large number
of predictors in the tree models can lead to prolonged computational time
and, thus, to missing one of the advantages of the Random
Forest model, which is the ability to perform predictions based
on a subset of the predictor variables. Alternately, including too small
a number of predictor variables may downgrade the model performance (since
this can exclude variables that may account for most of the variability
and trend in the data). In setting the number of predictor variables,
it is recommended that you use the default value, which is based on the
formula (see Breiman for further details).
The options on the Stopping
condition tab provide you with a set of criteria for finalizing
your current Random Forest model.
By default, building a Random Forest
involves adding a fixed number of trees (100).
However, for longer training runs there are better ways to specify when
training should stop. You can do this on the Stopping
The most useful option, perhaps, is the Percentage
decrease in training error. This states that if the training error
does not improve by at least the amount given over a number of epochs
(the Cycles to calculate mean error),
then training should stop.
Reviewing the Results.
For this example, leave all the options at their default values, and click
the OK button on the Random Forest Specifications dialog. The
Computing dialog will be displayed,
where you can watch the progress of the analysis as well as see how much
time has elapsed and how much is remaining.
Then, the Random Forest Results dialog will be displayed.
On the Quick
tab, click the Summary button
to review how consecutive training and testing classification rates progressed
over the entire training cycles.
This graph demonstrates the basic mechanism of how the Random
Forest algorithm implemented in STATISTICA
can avoid overfitting (see also the Introductory
Overview and Technical Notes). In general, as more and more simple
trees are added to the model, the misclassification rate for training
data (from which the respective trees were estimated) will generally decrease.
The same trend should be observed for misclassification rates defined
over the testing data. However, as more and more trees are added the misclassification
rate for the testing, data will at one point start to increase (while
the misclassification rate for the training set keeps decreasing), clearly
marking the point where evidence for overfitting is beginning to show.
By default, the program will stop adding trees even if the designated
number of trees you specified in the Number
of trees option on the Advanced
tab of the Random Forest Specifications dialog is
not reached. To turn off the stopping condition, simply clear the Enable advanced stopping condition
check box on the Stopping
condition tab of the Random Forest Specifications dialog.
In this case, the designated number of trees set in the Number
of trees option will be added to the Random
Accuracy of Prediction. Note
that you can generate predictions for any group of data cases of your
choice - training, testing, or the entire data set. Also, you can make
predictions for partially missing predictor cases, which is one of the
capabilities of Random Forest
models (see the Introductory
Overview and Technical Notes for further details).
To produce predictions for the test sample, for example, click on the
tab of the Results dialog
Select the Test set option
button in the Sample group box,
and then click the Predicted vs. observed
by classes button. The program will then display the spreadsheet
of predicted values and probability of class memberships. It will also
display a spreadsheet and a 3D histogram of the classification matrix,
together with a spreadsheet of the confusion matrix.
In addition, you may want to review the various additional summary statistics
(e.g., Risk estimates) and the
predictor importance (in the form of a histogram). The Predictor
importance graph contains the importance ranking on a 0-1 scale
for each predictor variable in the analysis. See Predictor
Importance in STATISTICA GC&RT,
Interactive Trees, and Boosted Trees.
Lift charts. Another way to
assess the accuracy of the predictions is to compute the lift chart for
each category of the dependent variable. Select the Classification
tab of the Results
dialog to compute gains or lift charts for different samples.
Interpreting the Results.
In general, Random Forest (see
the Introductory Overview and
Technical Notes) is best considered a machine learning model, i.e.,
a "black box" that to some extent will (usually) produce very
accurate predictions, but yield models that are not easily interpretable
(unlike, for example, linear models, where the final prediction model
can usually be "put into words," i.e., explained). To interpret
the results from the STATISTICA Random
Forest module, there are two key tools:
With the bar graph and spreadsheet of the predictor importance, you can
usually distinguish the variables that make the major contributions to
the prediction of the dependent variable of interest. Click the Bargraph of predictor importance button
on the Quick
tab to display a bar graph that pictorially shows the importance ranking
on a 0-1 scale for each predictor variable considered in the analysis.
This plot can be used for visual inspection of the relative importance
of the predictor variables used in the analysis and, thus, helps to conclude
which predictor variable is the most important predictor. See also, Predictor
Importance in STATISTICA GC&RT,
Interactive Trees, and Boosted Trees.. In this case, variables
ORD1, ORD5, and ORD12 stand out as the most important predictors.
Final trees. You can also review
the final sequence of trees, either graphically or in a sequence of results
spreadsheets (one for each tree). However, this may not be a useful way
to examine the "meaning" of the final model when the final solution
involves a large number of trees.
Deploying the Model
for Prediction. Finally, you can deploy the model via the Code generator on the Results
dialog - Report
tab. In particular, you may want to save the PMML deployment code
for the created model, and then use that code via the Rapid
Deployment Engine module to predict (classify) new cases.
Adding more trees/amending
your model. Rather than continually creating new models, which
may be time consuming, you can amend your existing Random
Forest without full model re-building. Upon analyzing your results
you may find, for example, that your model is not strong enough (i.e.,
does not fit the data well). In this case, you may want to add more trees
by simply specifying the number of trees to add in the Number
of more trees option and then clicking the More