Example 4: Predictive Data Mining for Categorical Output Variable (Classification)
The purpose of this example is to illustrate the power and ease of use
of the STATISTICA Data Miner
projects for advanced predictive data mining (see also Crucial
Concepts in Data Mining and Data
Specifically, with the Data Mining
menu commands Data
Miner - General Classifier and Data
Miner - General Modeler and Multivariate Explorer, you can
display pre-"wired" STATISTICA
Data Miner projects with automatic deployment that include collections
of very advanced and powerful techniques for predictive data mining. These
methods can work in competition or in unison to produce averaged,
or best predictions (see also meta-learning).
This example is based on the same example data file used to illustrate
visual data mining in Example 2,
This file contains information on the gender,
age, type of accommodation (class, i.e., first
class, second class, etc.)
and ultimate survival status
for the passengers of the ill-fated vessel.
The Advanced Comprehensive
bar. Select the Data Mining
tab. In the Tools group, click
Workspaces, and from the General Classifier (Trees and Clusters)
submenu, select Advanced Comprehensive
menus. From the Data
select Data Mining - Workspaces –
Data Miner-General Classifier (Trees and Clusters) –
Advanced Comprehensive Classifiers Project.
This will display the "pre-wired" GCAdvancedComprehensiveClassifiers.sdm
project, which consists of a single entry point (node) for connecting
the data and numerous nodes for fitting various models to those data.
The single connection point for the data is the Split
Input Data into Training and Testing Samples (Classification) node
in the upper-left corner of the workspace. Using random sampling (which
can be controlled in the Edit
Parameters dialog box for this node), this node will split
the sample of observed classifications (and predictors) into two samples:
one Training sample and one Testing sample (marked for deployment;
see option Data for deployed project;
do not re-estimate models in the Select
dependent variables and predictors dialog box topic).
The models will be fitted using the Training
sample and evaluated using the observations in the Testing
sample. By using observations that did not participate in the model fitting
computations, the goodness-of-fit statistics computed for predicted values
derived from the different fitted models can be used to evaluate the predictive
validity (accuracy) of each model, and hence can be used to compare models
and to choose one or more over others.
The node labeled Compute Best Predicted
Classification from all Models will automatically compute predictions
from all models, by either computing a voted prediction (see voting,
choosing the best prediction, or a combination of the two (see also Meta-learning).
These predictions will be placed in a data spreadsheet that can be connected
to other nodes (e.g., graphs) to summarize the analysis.
In summary, the Advanced Comprehensive
Classifiers Project will apply a suite of very different methods
to the classification problem, and automatically generate the deployment
information necessary to classify new observations using one of those
methods or combinations of methods. The combinations of techniques available
in STATISTICA Data Miner are
in fact the most powerful methods known to date for making predictions
or predictive classifications in very "difficult" environments,
where predictor variables are related in highly interactive and nonlinear
Specifying the Analysis.
To analyze the predictors of survival for the Titanic maritime disaster,
click the Data Source button
in the STATISTICA
Workspace, and browse to
and select the data file Titanic
as the input data source.
Select variable survival as
the Dependent; categorical variable,
and variables class, age,
and gender as the Predictor;
Again, in a real-world application it is absolutely essential to first
perform some careful data checking either interactively or by using the
nodes and options available in the Data
Cleaning and Filtering folder of the Node Browser (see
also Crucial Concepts in Data Mining
1) to ensure that the data are "clean," i.e., do not contain
erroneous numbers, miscoded values, etc. We will skip this (usually very
important) step for this example, since we already know that the example
data file Titanic contains verified
Next, connect the data source to the Split
input data... node, and then click the Run
button to train the system.
A large number of documents (spreadsheets) will be produced; predicted
and observed values for the observations in the testing sample, for each
type of model, are placed into generated data sources labeled Testing...
for subsequent analyses. Detailed results statistics and graphs for each
analysis are placed into the workbook called Reporting
Documents. You could now double-click on the workbook to review
the results for each model.
After training the system, misclassification rates are automatically computed
for each node (method or model) from the testing data. This information
will be used by the Compute Best Prediction...
node, for example, to select the best classification method (model), or
to compute the voted response for the best two or three methods (see also
You can review this information in the Global Dictionary, which acts as a project-wide
repository of information generated by the scripts (marked ...with
Ribbon bar. Select the Edit tab. In the Dictionary
group, click Edit.
Classic menus. From the Tools - Global
Dictionary submenu, select Edit Global
The Edit Global Dictionary Parameters
dialog box will be displayed, where you can review the information generated
by the nodes in this project.
Even if you followed along through this example step by step, the information
that you see in this dialog box may differ somewhat from that displayed
here, because the random split of the data into training and testing sets,
and other method-specific random selections applied to the analyses (e.g.,
in neural network), may produce slightly different results each time.
The information displayed here shows for which nodes deployment information
currently exists, and the misclassification rates when each of the fitted
models is used to classify observations in the testing sample. Note that
the abbreviation Testing_method_number
refers to the name of the input data source, the (abbreviation for) the
method used to generate the prediction, and a number referring to the
specific node (model) that generated the prediction (see also the description
of the Show
Node Identifiers option, available from the View
menu). You can see that the tree classifier (CHAID)
made predictions for the Testing
sample, which resulted in the lowest misclassification rates.
The Goodness of Fit
for Multiple Inputs Node. The Goodness
of Fit for Multiple Inputs node is one way to evaluate the different
models. This tool will use the testing output spreadsheets from each model
building tool as input. Variable selections should be made for each of
these output spreadsheets.
Double-click the GDA output spreadsheet, Testing_PMML_GDA*,
generated by the General Discriminate
Analysis node to display the Select
dependent variables and predictors dialog box. Click the Variables button to display the variable
selection dialog box. The Dependent,
categorical variable should be observed survival,
variable 3. The Predictor, categorical
variable is the predicted Survival, variable 1. After making these selections,
click OK in the variable selection dialog box, and then select the Always use these selections, overriding any
selections the generating node may make check box. Click OK.
Then, repeat this process for the remaining testing output spreadsheet
Highlight all of the modified testing output nodes so they will automatically
be connected to the Goodness of fit
node. Open the Node Browser.
Expand the Data Mining folder
and select the Goodness of Fit
folder. In the right pane, select Goodness of fit for Multiple inputs.
Click Insert into the workspace.
In the Workspace, double-click on that node to display the Edit Parameters
dialog box, and set the Variable type
Also, on the Categorical tab
of this dialog box, specify that the node compute the Percent
disagreement between the predicted and observed classifications
(select the True option button).
Click the OK button.
Then, run only the selected node:
Ribbon bar. Select the Edit tab. In the Run
group, click Selected Node.
Classic menus. On
the Run menu, click Run
to Selected Node.
This will update only this node.
Double-click on the Reporting Documents
workbook to see to see the results. The last output spreadsheet gives
the overall summary of all models and all tests. Percent disagreement
is lowest for the CHAID model,
row 2 of the summary output, 22.6158.
Browse through the results workbooks to the Exhaustive CHAID folder
to see the specific solution generated by this classifier.
If you follow this decision tree (see also Classification and Regression
Trees), you will see that women in first and second class were
predicted to have a much higher chance of survival, as did male children
in first and second class (there were no children crew members). This
solution, which could have been expected, nevertheless demonstrates that
the program found a sensible model for predicting classifications.
Predicted Classifications. While deployment – predicting classifications
for new cases where observed values do not exist (yet) – isn't useful
in the present example, you could, nevertheless, now attach to the node
named Compute Best Prediction from All
Models a new data source that has missing data for the categorical
dependent variable (see also Example
3 for prediction of a continuous dependent variable from multiple
models). The program would then compute predicted classifications based
on a "vote" (which categories gets the most predictions) made
by all models. This is the default method of combining different models
used by the Compute Best Prediction...
node; you can display the Edit
Parameters dialog box for that node to select one of the other
methods for combining predicted classifications as well.
The Best prediction and Vote of best
k predictions options would automatically identify (based on the
Testing sample misclassification
rates) which models were most accurate, and use those models to compute
a voted prediction (see also bagging,
voting, or Meta-learning).
Deploying the Solution
to the "Field". To reiterate (see also Analysis
Nodes with Automatic Deployment, the deployment information is kept
along with the data miner project in a Global
Dictionary, which is a workspace-wide repository of parameters. This
means that you could now save this Data
Miner project under a different name, and then delete all analysis
nodes and related information except the Compute
Best Prediction from All Models node and the data source with new
observations (marked for deployment). You could now simply enter values
for the predictor variables, run this project (with the Compute
Best Prediction from All Models node only), and thus quickly compute
predicted classifications. Because STATISTICA
Data Miner, as all analyses in STATISTICA,
can be called from other applications, advanced applications could involve
calling this project from some other (e.g., data entry) application.
Ensuring that deployment information
is up to date. In general, the deployment information for the different
nodes that are named ...with Deployment
is stored in various forms locally along with each node, as well as globally,
"visible" to other nodes in the same project. This is an important
point to remember, because for Classification
and Discrimination (as well as Regression
Modeling and Multivariate Exploration), the node Compute
Prediction from All Models will compute predictions based on all
deployment information currently available in the global dictionary. Therefore,
when building models for deployment using these options, ensure that all
deployment information is up to date, i.e., based on models trained on
the most current set of data. You can also use the Clear
All Deployment Info nodes in the Data
Miner Workspace to programmatically clear out-of-date deployment
information every time the project is updated.
Predicting new observations, when observed
values are not (yet) available. When connecting data for deployment
(prediction or predicted classification) to the nodes for Classification
and Discrimination or Regression
Modeling and Multivariate Exploration, ensure that the "structure"
of the input file for deployment is the same as that used for building
the models (see also the option description for Data
for deployed project; do not re-estimate models in the Select
dependent variables and predictors dialog box topic). Specifically,
ensure that the same numbers and types of predictor variables are specified,
that a (continuous or categorical) dependent variable is specified (even
if all values for that variable are missing), and that the variable names
match those in the data file used to build the models (this is particularly
important for the deployment of neural networks, which will rely on this
information). Also, when using numeric variables with text values as categorical
predictors or dependent variables, ensure that consistent coding is used
throughout the Data Miner project.
For additional details, refer to Using
text variables or text values in data miner projects; a detailed technical
discussion of this issue, and the manner in which STATISTICA
Data Miner handles text variables and values, see Working
with Text Variables and Text Values: Ensuring Consistent Coding.
The purpose of this example is to show how easily a large number of the
most sophisticated methods for predictive data mining can be applied to
data, and how sophisticated ways for combining the power of these methods
for predicting new observations becomes automatically available. The techniques
provided in STATISTICA Data Miner
represent some of the most advanced techniques for predictive data mining
See also, Data Mining Definition, Data Mining with STATISTICA Data Miner, Structure
and User Interface of STATISTICA
Data Miner, STATISTICA
Data Miner Summary, and Getting
Started with STATISTICA Data