Data Miner Recipes Example

Overview

A general trend in data mining is the increasing emphasis on solutions based on simple analytic processes rather than the creation of ever-more sophisticated general analytic tools. The Statistica Data Miner Recipes (DMR) approach provides an intuitive graphical interface to enable those with limited data mining experience to execute a ”recipe-like,” step-by-step analytic process. With these intuitive dialogs, you can perform various data mining tasks such as regression, classification, and clustering. Other recipes can be built quickly as custom solutions. Completed recipes can be saved and deployed as project files to score new data.

The Statistica Data Miner Recipes module spans the entire data mining process from querying data to the final deployment of solutions and, in general, consists of the following steps.

1. Identifies the data from which to learn

  • Connects to ODBC or OLEDB compliant databases through the Streaming DB Connector

  • Connects to Statistica data files

2. Cleans data and removes the redundant predictors

  • Flexible and efficient methods for sampling the data (simple, stratified, systematic, etc.)

  • More flexible ways to identify and recode the missing data

  • Identification of outliers

  • Transform the data prior to performing the subsequent steps

  • Identify and eliminate redundant predictors

3. Identifies important predictors from a large pool of predictors that are strongly related to the dependent (outcome or target) variable of interest

  • Feature selection for very large data sets (e.g., thousands of variables)

  • Detection of important interactions among the predictors by using tree-based methods

4. Generates a pool of eligible models

  • Leverage the comprehensive selection of cutting edge techniques for predictive data mining available in DMR

  • Offload computationally expensive tasks to Statistica Enterprise Server, freeing your local computer for other tasks

5. Performs automatic competitive evaluation of models to identify the optimum model with respect to performance, and complexity

6. Deploys the model to score new data using the efficient deployment engine

This example illustrates how quickly and efficiently data mining projects can be completed using Statistica Data Miner Recipes, even if the best solution to the (prediction) problem emerges only after (automatically) comparing the efficacy of various advanced data mining algorithms.

Example

In this example, we will explore the use of Statistica Data Miner Recipes for Credit Scoring applications. The example is based on the data file CreditScoring.sta, which contains observations on 18 variables for 1,000 past applicants for credit. Each applicant was rated as ”good credit” (700 cases) or ”bad credit” (300 cases). We want to develop a credit scoring model that can be used to determine if a new applicant is a good credit risk or a bad credit risk, based on the values of one or more of the predictor variables. An additional ”Train/Test” indicator variable is also included in the data file for validation purposes.

Start Data Miner Recipes:

Ribbon bar. Select the Data Mining tab. In the Recipes group, click Data Miner Recipes to display the Data miner recipes dialog box.

Classic menus. From the Data Mining menu, select Data Miner Recipes to display the Data miner recipes dialog box.

Click the New button to create a new project.

The step-node panel is located in the upper-left area of the Steps tab. It contains four major nodes: Data preparation, Data for analysis, Data redundancy, and Target variable.

Nodes (steps). Each node (or step) can exist in one of three states at most (depending on whether its completion is optional). Each state is represented by an icon: a red indicates a wait state, meaning a step cannot be started because it is dependent on a previous step that has not been completed; a yellow indicates a ready state, meaning you are ready to start the step because previous steps have been completed; a green indicates a completed step. Note that you must click the Next step button to change the yellow (ready state) to the green (completed state). The change will be made only if the step has been successfully completed.

Data Preparation

Connecting data. On the Data preparation tab, click the Open/Connect data file button. In the Select Data Source dialog box, click the Files button and locate and open the CreditScoring.sta data file (located in the Examples/Datasets folder installed with Statistica - on most computers: C/Program Files/Dell/Statistica/Examples).

Click the Select variables button. In the Select variables dialog box, select the Show appropriate variables only check box. Then, select:

  • Variable 1 (Credit Rating) as the Target, categorical variable,

  • Variables 3, 6, and 14 as Input, continuous (continuous predictors)

  • Variables 2, 4-5, 7-13, and 15-18 as Input, categorical (categorical predictors)

  • Variable 19 (TrainTest) as the Testing sample (validation sample variable)

Click the OK button in the variable selection dialog box.

In the Data miner recipes dialog box, select the Advanced tab. Select the Use sample data check box. Select the Stratified random sampling option button as the sampling strategy to ensure that each class of the dependent variable Credit Rating is represented with approximately equal numbers of cases in train and validation sets.

Then, click the More options button to display the Stratified sampling dialog box. Click the Strata variables button, select Credit Rating as the strata variable, and click OK in this dialog box and click OK in the Stratified sampling dialog box.

Click the Next step button for the Data preparation step to ensure that this step has been successfully completed (in the step-node panel next to Data preparation, the yellow changes to a green ).

Data for Analysis

After the Data preparation step is completed, the Data for analysis step will be selected automatically.

On the Data for analysis tab, click the Select testing sample button. In the Testing Sample Specifications dialog box, select the Variable option button. Verify that the category (value) Train is selected in the Code for training sample field and Test is selected in the Code for testing sample field.

Then, click the OK button. The models will be fitted using the training sample and evaluated using the observations in the testing sample. By using observations that did not participate in the model fitting computations, the goodness-of-fit statistics computed for (predicted values derived from) the different data mining models (algorithms) can be used to evaluate the predictive validity of each model and, hence, can be used to compare models and to choose one or more over others.

Descriptive statistics. This step will also compute descriptive statistics for all variables selected in the analysis. Descriptive stats provide useful information about ranges and distributions of the data used for the project.

Click the Next step button to ensure that this step is successfully complete.

Data Redundancy

Now, the Data redundancy step will be selected. The purpose of the Data redundancy step is to eliminate highly redundant predictors. For example, if the data set contained two measures for weight, one in kilograms the other in pounds, those two measures would be redundant.

On the Data redundancy tab, select the Correlation coefficient option button.

Specify the Criterion value as 0.8.

Click the Next step button to eliminate the redundant predictors that are highly correlated (r≥0.8). Since there is no redundancy in the data set we are using in this example, a message dialog box will be displayed stating this.

Click the OK button. The data cleaning and preprocessing for model building is now complete.

Target Variable: Building Predictive Model

Next, we need to build predictive models for the target in this example. In the step-node panel, the Target variable node has a branching structure with the parent node connecting to four child nodes including Important variables, Model building, Evaluation, and Deployment.

Important variables. The Important variables node is selected automatically. In this step, the goal is to reduce the dimensionality of the prediction problem, i.e., to select a subset of inputs that is most likely related to the target variable (in this example, Credit rating) and, thus, is most likely to yield accurate and useful predictive models. This type of analytic strategy is also sometimes called feature selection.

Two strategies are available. If the Fast predictor screening option button is selected, the program will screen through thousands of inputs and find the ones that are strongly related to the dependent variable of interest. If the Advanced screening option button is selected, tree methods are used to detect important interactions among the predictors.

For this example, select the Advanced screening option button as the feature selection strategy. Then, click the Advanced screening button to display the Advanced screening dialog box. Enter 12 in the Number of predictors to extract field.

Click the OK button in this dialog box, and then click the Next step button to complete this step. To review a summary of the analysis thus far, on the Steps tab, click the Report button, and from the drop-down list, select Summary report to display the Results workbook.

These predictors will be further examined using various cutting-edge data mining and machine learning algorithms available in DMR.

Building models. The Data miner recipe dialog box was minimized so that you can see the Results workbook. Click the Data miner recipes button located on the Analysis Bar at the bottom of the application to display the dialog box again.

Now, the Model building node is selected. In this step, you can build a variety of models for the selected inputs. On the Model building tab, the C&RT, Boosted tree, and Neural network check boxes are selected by default as the models or algorithms that will automatically be "tried” against the data.

The computations for building predictive models can be performed either locally (on your computer) or on the Statistica Enterprise Server. However, the latter option is available only if you have a valid Statistica Enterprise Server account and you are connected to the server installation at your site.

For this example, click the Build model button to perform the computations locally on your computer. This will take a few moments; when finished, click the Next step button to complete this step.

Evaluating and selecting models. Now, the Evaluation node is selected. On the Evaluation tab, click the Evaluate models button to perform the competitive evaluation of models for identifying the best performing model in terms of performance in the validation sample.

Notice that the Neural network model has the minimum error rate of 35.75% (exact results may vary). In other words, 64.25% of the cases in the validation sample are correctly predicted by this model. Note that your results (the best model and the percentages) may vary because these advanced data mining methods randomly split the data into subsets during training to produce reliable estimates of the error rates.

On the Steps tab, click the Report button, and from the drop-down list, select Summary report to display the Results workbook. Review the Summary Frequency table (predictions) output for the best model.

This spreadsheet shows the classification performance of the best model on the validation data set. The columns represent the predicted class frequencies, as predicted by the Neural network model, and the rows represent the actual or observed classes in the validation sample. In this matrix, you can see that this model predicted 145 out of 197 ”bad credit risks” correctly, but misclassified 52 of them. This information is usually much more informative than the overall misclassification rate, which simply tells us that the overall accuracy is 76.61%.

Display the Data miner recipes dialog box again, and click the Next step button to complete this step.

Deployment. The final Deployment step involves using the best model and applying it to new data in order to predict the ”good or bad” customers. In this case, deploy the Neural network model that gave us the best predictive accuracy on the test sample when compared to the other models. This step also provides the option for writing back the scoring information (classification probabilities computed by the best model, predicted classification, etc.) to the original input data file or database. This is extremely useful for deploying models on very large data sets to ”score” databases.

On the Deployment tab, click the Data file for deployment button and double-click on the CreditScoring.sta data file (located in the Examples/Datasets folder installed with StatisticaA). For demonstration purposes, we are using the same data file for deployment of the best model.

Click the Next step button to score this data file using the best model. The scored file with classifications and prediction probabilities (titled Summary of Deployment) is located in the Deployment folder in the project workbook as shown below.

Summary

The purpose of this example is to demonstrate the efficiency of the data miner workflow implemented in the Statistica Data Miner Recipes. With only a few clicks, the program guides you through the complete analytic process - from the definition of input data and analysis problem, through data cleaning and preparation and model building, all the way to final model selection and deployment.

Even though most of the computational complexities of data mining are resolved automatically in Statistica Data Miner Recipes, which enables you to move from problem definition to a solution very quickly even if you are a novice, the program will ”apply and try” a large number of advanced data mining algorithms and automatically determine which approach is most successful.

Thus, the Statistica Data Miner Recipes methodology and user interface enables you to leverage the largest collection of data mining algorithms in a single package to solve your problems.