Example 4: Predictive Data Mining for Categorical Output Variable (Classification)

The purpose of this example is to illustrate the power and ease of use of the STATISTICA Data Miner projects for advanced predictive data mining (see also Crucial Concepts in Data Mining and Data Mining Tools.

Specifically, with the Data Mining menu commands Data Miner - General Classifier and Data Miner - General Modeler and Multivariate Explorer, you can display pre-"wired" STATISTICA Data Miner projects with automatic deployment that include collections of very advanced and powerful techniques for predictive data mining. These methods can work in competition or in unison to produce averaged, voted, or best predictions (see also meta-learning).

This example is based on the same example data file used to illustrate visual data mining in Example 2, Titanic.sta.

 

This file contains information on the gender, age, type of accommodation (class, i.e., first class, second class, etc.) and ultimate survival status for the passengers of the ill-fated vessel.

The Advanced Comprehensive Classifiers Project.

Ribbon bar. Select the Data Mining tab. In the Tools group, click Workspaces, and from the General Classifier (Trees and Clusters) submenu, select Advanced Comprehensive Classifiers Project.

Classic menus. From the Data Mining menu, select Data Mining - Workspaces Data Miner-General Classifier (Trees and Clusters) Advanced Comprehensive Classifiers Project.

This will display the "pre-wired" GCAdvancedComprehensiveClassifiers.sdm project, which consists of a single entry point (node) for connecting the data and numerous nodes for fitting various models to those data.

The single connection point for the data is the Split Input Data into Training and Testing Samples (Classification) node in the upper-left corner of the workspace. Using random sampling (which can be controlled in the Edit Parameters dialog box for this node), this node will split the sample of observed classifications (and predictors) into two samples: one Training sample and one Testing sample (marked for deployment; see option Data for deployed project; do not re-estimate models in the Select dependent variables and predictors dialog box topic).

The models will be fitted using the Training sample and evaluated using the observations in the Testing sample. By using observations that did not participate in the model fitting computations, the goodness-of-fit statistics computed for predicted values derived from the different fitted models can be used to evaluate the predictive validity (accuracy) of each model, and hence can be used to compare models and to choose one or more over others.

The node labeled Compute Best Predicted Classification from all Models will automatically compute predictions from all models, by either computing a voted prediction (see voting, bagging), choosing the best prediction, or a combination of the two (see also Meta-learning). These predictions will be placed in a data spreadsheet that can be connected to other nodes (e.g., graphs) to summarize the analysis.

In summary, the Advanced Comprehensive Classifiers Project will apply a suite of very different methods to the classification problem, and automatically generate the deployment information necessary to classify new observations using one of those methods or combinations of methods. The combinations of techniques available in STATISTICA Data Miner are in fact the most powerful methods known to date for making predictions or predictive classifications in very "difficult" environments, where predictor variables are related in highly interactive and nonlinear ways.

Specifying the Analysis. To analyze the predictors of survival for the Titanic maritime disaster, click the Data Source button in the STATISTICA Workspace, and browse to and select the data file Titanic as the input data source.

Select variable survival as the Dependent; categorical variable, and variables class, age, and gender as the Predictor; categorical variables.

Again, in a real-world application it is absolutely essential to first perform some careful data checking either interactively or by using the nodes and options available in the Data Cleaning and Filtering folder of the Node Browser (see also Crucial Concepts in Data Mining and Example 1) to ensure that the data are "clean," i.e., do not contain erroneous numbers, miscoded values, etc. We will skip this (usually very important) step for this example, since we already know that the example data file Titanic contains verified values.

Next, connect the data source to the Split input data... node, and then click the Run button to train the system.

A large number of documents (spreadsheets) will be produced; predicted and observed values for the observations in the testing sample, for each type of model, are placed into generated data sources labeled Testing... for subsequent analyses. Detailed results statistics and graphs for each analysis are placed into the workbook called Reporting Documents. You could now double-click on the workbook to review the results for each model.

Evaluating Models. After training the system, misclassification rates are automatically computed for each node (method or model) from the testing data. This information will be used by the Compute Best Prediction... node, for example, to select the best classification method (model), or to compute the voted response for the best two or three methods (see also Meta-learning). You can review this information in the Global Dictionary, which acts as a project-wide repository of information generated by the scripts (marked ...with Deployment).

Ribbon bar. Select the Edit tab. In the Dictionary group, click Edit.

Classic menus. From the Tools - Global Dictionary submenu, select Edit Global Dictionary.

The Edit Global Dictionary Parameters dialog box will be displayed, where you can review the information generated by the nodes in this project.

 

Even if you followed along through this example step by step, the information that you see in this dialog box may differ somewhat from that displayed here, because the random split of the data into training and testing sets, and other method-specific random selections applied to the analyses (e.g., in neural network), may produce slightly different results each time.

The information displayed here shows for which nodes deployment information currently exists, and the misclassification rates when each of the fitted models is used to classify observations in the testing sample. Note that the abbreviation Testing_method_number refers to the name of the input data source, the (abbreviation for) the method used to generate the prediction, and a number referring to the specific node (model) that generated the prediction (see also the description of the Show Node Identifiers option, available from the View menu). You can see that the tree classifier (CHAID) made predictions for the Testing sample, which resulted in the lowest misclassification rates.

The Goodness of Fit for Multiple Inputs Node. The Goodness of Fit for Multiple Inputs node is one way to evaluate the different models. This tool will use the testing output spreadsheets from each model building tool as input. Variable selections should be made for each of these output spreadsheets.

Double-click the GDA output spreadsheet, Testing_PMML_GDA*, generated by the General Discriminate Analysis node to display the Select dependent variables and predictors dialog box. Click the Variables button to display the variable selection dialog box. The Dependent, categorical variable should be observed survival, variable 3. The Predictor, categorical variable is the predicted Survival, variable 1. After making these selections, click OK in the variable selection dialog box, and then select the Always use these selections, overriding any selections the generating node may make check box. Click OK.

Then, repeat this process for the remaining testing output spreadsheet nodes.

Highlight all of the modified testing output nodes so they will automatically be connected to the Goodness of fit node. Open the Node Browser. Expand the Data Mining folder and select the Goodness of Fit folder. In the right pane, select Goodness of fit for Multiple inputs.

Click Insert into the workspace.

In the Workspace, double-click on that node to display the Edit Parameters dialog box, and set the Variable type to Categorical.

 

Also, on the Categorical tab of this dialog box, specify that the node compute the Percent disagreement between the predicted and observed classifications (select the True option button). Click the OK button.

Then, run only the selected node:

Ribbon bar. Select the Edit tab. In the Run group, click Selected Node.

Classic menus.  On the Run menu, click Run to Selected Node.

This will update only this node.

Double-click on the Reporting Documents workbook to see to see the results. The last output spreadsheet gives the overall summary of all models and all tests. Percent disagreement is lowest for the CHAID model, row 2 of the summary output, 22.6158.

Browse through the results workbooks to the Exhaustive CHAID folder to see the specific solution generated by this classifier.

If you follow this decision tree (see also Classification and Regression Trees), you will see that women in first and second class were predicted to have a much higher chance of survival, as did male children in first and second class (there were no children crew members). This solution, which could have been expected, nevertheless demonstrates that the program found a sensible model for predicting classifications.

Deployment: Computing Predicted Classifications. While deployment – predicting classifications for new cases where observed values do not exist (yet) – isn't useful in the present example, you could, nevertheless, now attach to the node named Compute Best Prediction from All Models a new data source that has missing data for the categorical dependent variable (see also Example 3 for prediction of a continuous dependent variable from multiple models). The program would then compute predicted classifications based on a "vote" (which categories gets the most predictions) made by all models. This is the default method of combining different models used by the Compute Best Prediction... node; you can display the Edit Parameters dialog box for that node to select one of the other methods for combining predicted classifications as well.

 

The Best prediction and Vote of best k predictions options would automatically identify (based on the Testing sample misclassification rates) which models were most accurate, and use those models to compute a voted prediction (see also bagging, voting, or Meta-learning).

Deploying the Solution to the "Field". To reiterate (see also Analysis Nodes with Automatic Deployment, the deployment information is kept along with the data miner project in a Global Dictionary, which is a workspace-wide repository of parameters. This means that you could now save this Data Miner project under a different name, and then delete all analysis nodes and related information except the Compute Best Prediction from All Models node and the data source with new observations (marked for deployment). You could now simply enter values for the predictor variables, run this project (with the Compute Best Prediction from All Models node only), and thus quickly compute predicted classifications. Because STATISTICA Data Miner, as all analyses in STATISTICA, can be called from other applications, advanced applications could involve calling this project from some other (e.g., data entry) application.

Ensuring that deployment information is up to date. In general, the deployment information for the different nodes that are named ...with Deployment is stored in various forms locally along with each node, as well as globally, "visible" to other nodes in the same project. This is an important point to remember, because for Classification and Discrimination (as well as Regression Modeling and Multivariate Exploration), the node Compute Prediction from All Models will compute predictions based on all deployment information currently available in the global dictionary. Therefore, when building models for deployment using these options, ensure that all deployment information is up to date, i.e., based on models trained on the most current set of data. You can also use the Clear All Deployment Info nodes in the Data Miner Workspace to programmatically clear out-of-date deployment information every time the project is updated.

Predicting new observations, when observed values are not (yet) available. When connecting data for deployment (prediction or predicted classification) to the nodes for Classification and Discrimination or Regression Modeling and Multivariate Exploration, ensure that the "structure" of the input file for deployment is the same as that used for building the models (see also the option description for Data for deployed project; do not re-estimate models in the Select dependent variables and predictors dialog box topic). Specifically, ensure that the same numbers and types of predictor variables are specified, that a (continuous or categorical) dependent variable is specified (even if all values for that variable are missing), and that the variable names match those in the data file used to build the models (this is particularly important for the deployment of neural networks, which will rely on this information). Also, when using numeric variables with text values as categorical predictors or dependent variables, ensure that consistent coding is used throughout the Data Miner project. For additional details, refer to Using text variables or text values in data miner projects; a detailed technical discussion of this issue, and the manner in which STATISTICA Data Miner handles text variables and values, see Working with Text Variables and Text Values: Ensuring Consistent Coding.  

Conclusion. The purpose of this example is to show how easily a large number of the most sophisticated methods for predictive data mining can be applied to data, and how sophisticated ways for combining the power of these methods for predicting new observations becomes automatically available. The techniques provided in STATISTICA Data Miner represent some of the most advanced techniques for predictive data mining available today.

See also, Data Mining Definition, Data Mining with STATISTICA Data Miner, Structure and User Interface of STATISTICA Data Miner, STATISTICA Data Miner Summary, and Getting Started with STATISTICA Data Miner.