Getting Started with Dell Statistica Data Miner

To use the Statistica data mining tools, follow these steps:

1. Select the desired option on the Data Mining tab (ribbon bar) or the Data Mining menu (classic menus).

The Statistica Data Mining tab and Data Mining menu contain options to create a data miner workspace where you can build and maintain complex models, options to select predefined templates of data miner workspaces for simple and complex tasks, and options to select Statistica analysis modules for particular specialized analyses. Also available is the Data Miner Recipes option, which guides you through a step-by-step process to simplify the method of predictive model building.

Ribbon bar

Classic menus

Data Miner - All Procedures. Select this command (located on the Data Mining - Workspaces submenu) to create a new data mining workspace.

Data Miner - Data Cleaning and Filtering. Select this command (located on the Data Mining - Workspaces submenu) to choose from a large number nodes for "cleaning" the data, i.e., for filtering out invalid data values, missing data replacement, user-defined transformations, ranking, standardization, etc.; the very powerful Feature Selection and Variable Screening node enables you to quickly process very large lists of continuous and categorical predictors for regression and classification problems, and to select a subset that is most strongly related to the dependent (outcome) variables of interest. The algorithm for selecting those variables is not biased in favor of a single method for subsequent analyses (e.g., pick the highest correlations for later analyses via linear models), and the resulting variable lists are made available as mappings into the original data source, so that no actual data need to be copied (e.g., from a remote data base). For additional details, see also Feature Selection and Variable Screening .

Data Miner - General . . . . Select any of these commands (located on the Data Mining - Workspaces submenu) to display predefined sets of data mining templates for typical types of analysis problems. The General Slicer/Dicer Explorer with Drill-Down command also provides access to a specialized interactive drill-down tool.

Neural Networks, Independent Components Analysis, Generalized EM & k-Means Cluster Analysis, Association Rules, General Classification/Regression Tree Models, General CHAID Models, Interactive Trees (C&RT, CHAID), Boosted Tree Classifiers and Regression, Random Forests for Regression and Classification, Generalized Additive Models, MARSplines (Multivariate Adaptive Regression Splines), Machine Learning (Bayesian, Support Vectors, Nearest Neighbors). These commands will display the modules for performing the respective types of analyses interactively, using the standard Statistica user interface.

Rapid Deployment of Predictive Models (PMML); Goodness of Fit, Classification, Prediction; Feature Selection and Variable Filtering; Combining Groups (Classes) for Predictive Data-Mining. These commands display the respective specialized modules; Rapid Deployment of Predictive Models quickly generates predictions from one or more previously trained models based on information stored in industry-standard PMML (Predictive Model Markup Language) deployment code. Goodness Of Fit computes various goodness-of-fit statistics and graphs for regression and classification problems. Feature Selection and Variable Screening is used to select variables (columns) from very large data sets or external databases, e.g., to select subsets of predictors from hundreds of thousands of predictors, or even more than one million predictors. Combining Groups (Classes) for Predictive Data-Mining is used to automatically find and implement a best recoding scheme for the prediction of a continuous or categorical variable from one or more categorical predictors with many classes (e.g., such as SIC codes with more than 10,000 distinct values).

Example.

Ribbon bar. Select the Data Mining tab. In the Tools group, click Workspaces, and select All Validated Procedures to display a standard workspace.

Classic menus. From the Data Mining - Data Mining - Workspaces submenu, select Data Miner - All Procedures  to display a standard workspace.

2. Select a new data source

Next, specify the input data for the data mining project. Click the Data Source button on the data miner workspace to display a standard data file selection dialog where you can select either a Statistica  data file (Statistica Spreadsheet designated for Input), or a database connection for in-place processing of data in remote databases (see Streaming Database Connector Technology).  

Example. Select the Boston2.sta data file (located in the Statistica examples):

This example data file contains the data from the Boston housing study (Harrison & Rubinfeld, 1978) reported in Lim, Loh, and Shih (1997). Click OK to select the variables for the analyses.

3. Select the variables for the analyses.

Next, select the variables for the analyses. Statistica Data Miner distinguishes between categorical and continuous variables, and dependent and predictor (independent variables). Categorical variables are those that contain information about some discrete quantity or characteristic describing the observations in the data file ( Gender: Male or Female); continuous variables are measured on some continuous scale ( Height, Weight, Cost). Dependent variables are the ones we want to predict; they are also sometimes called outcome variables; predictor (independent) variables are those that we want to use for the prediction or classification (of categorical outcomes).

You don't have to select variables into each list; in fact some types of analyses only expect a single list of variables ( cluster analysis). You can also make additional selections, such as specify codes for categorical variables, case selection conditions, or case weights, or you can specify censoring, a learning/testing variable, etc.

Example. Select the variable Price as the categorical dependent variable that is to be predicted; select variable Cat1 as a categorical predictor, and select variables ORD1 through ORD12 as continuous predictors.

Now click OK to add this data source to the data mining workspace.

4. Display the Node Browser and select the desired analyses or data management operation.

Next, click the Node Browser button on the data miner workspace, or display the Node Browser by selecting that command from the Nodes menu; you can also press CTRL+B on your keyboard to display the node browser.

The Node Browser contains all the procedures available for data mining in the data miner workspace; there are more than 260 procedures to choose from for data filtering and cleaning, and for data analysis. By default, all procedures are organized in folders along with the types of analyses that they perform. However, the Node Browser is fully configurable. You can specify multiple Node Browser configurations, and these customizations will automatically be saved along with the data miner workspace. Thus, you can greatly simplify routine analyses by fully customizing the default Node Browser configuration for your work.

To select analyses (analysis nodes), highlight them in the right pane and click the Insert into workspace button on the Node Browser toolbar; you can also simply double-click on the analysis node of interest to insert it into the workspace. The lower pane of the Node Browser contains a description of the currently highlighted selection.

Example. Select the nodes for Descriptive Statistics, then scroll down in the left pane of the Node Browser and select the folder labeled Classification and Discrimination, and in the right pane, select Standard Classification Trees with Deployment.

If a data source in the workspace is currently highlighted, it will be connected automatically to the nodes as they are selected (inserted) into the workspace. You can also use the Connect toolbar button to connect data sources to nodes; to delete an arrow, click on it and select Delete from the shortcut menu (displayed by right-clicking your mouse), or press the DEL key on your keyboard; you can temporarily disable an arrow by selecting Disable from the shortcut menu. Arrows that are disabled will not be updated or recomputed.

5. Run (update) the data miner project.

Next, run the data miner project. All nodes connected to data sources via (non-disabled) arrows will be updated, and the respective analyses will be produced.

A note on data cleaning, filtering, and EDA. The Statistica Data Miner project workspace is fully integrated into the Statistica data analysis environment. At any point, you can click on a data source or results workbook (spreadsheet, report), either in the Data Acquisition area or in any other area ( data sources created by analyses), to review the respective document. Also, you can use any of the interactive analyses available in Statistica  to further explore those documents, for example, to run simple descriptive statistics or create descriptive graphs to explore the respective results. These types of Exploratory Data Analysis techniques (EDA) are indispensable for data cleaning and verification. For example, it is useful to always run simple descriptive statistics, computing the minima and maxima for variables in the analyses, to ensure that data errors (impossible values) are corrected before they lead to erroneous conclusions. Also, the various options on the Data menu of the data spreadsheet toolbar are very useful for cleaning and verifying the data in interactive analyses before submitting them to further analyses.

Example. Click the Run button, select Run All Nodes from the Run menu, or press F5.

Detailed results are created by default for each type of analysis in Statistica Workbooks; double-click on a workbook to review its contents. You can also connect all the green arrows to the workbooks into a single workbook, to direct all results to a single container. The complete functionality of Statistica Workbooks is available for these results, so you can save these results, drag, drop, edit individual results and graphs, mark certain spreadsheets as input data for subsequent analyses, and so on.

6. Customize analyses, edit results, save results.

The next step is to review the results, edit the analyses, etc.

  • In general, click on any icon, and then use the shortcut menu to review the various options available for the object (analysis, data source, output document, result, etc.).

  • To review results, double-click on the Workbooks or other documents created by the analyses. Use the options on the Data Miner tab of the Options dialog box to configure Statistica Data Miner, for example, to direct output to reports instead of workbooks.

  • To edit analyses (change the parameters for the analyses), double-click on the respective analysis icons; this will display the Edit Parameters dialog box, which contain parameters and settings specific to the respective node.

  • To edit documents created by analyses for downstream analyses, click on the item, and select View Document from the shortcut menu.

  • You can delete nodes by highlighting them and pressing the DEL key, or selecting Delete from the shortcut menu, or using the standard Undo methods (press CTRL+Z, or click the Undo button on the toolbar) to undo your changes to the data miner workspace.

  • To save the workspace, select Save from the File menu; the default filename extension for the data miner workspace is .sdm. By default, the program will save all input data sources embedded into the data mining project; this default can be changed by clearing the Embed input files in Data Miner project files when saving check box on the Data Miner tab of the Options dialog box.

Example. To compute various graphical summaries, double click on the Descriptive Statistics node, and set the Detail of reported results parameter to All results. Then click OK. Next, double-click on the Standard Classification Trees with Deployment node, select the V-Fold Cross-validation tab, and request V-Fold cross-validation; this is a very important safeguard against over-learning.

Next, click on the General tab and set the Minimum n per node (of the final tree) to 50; this will cause the tree growing procedure to terminate when the node size falls below that number, and hence to create less complex trees. Then click the OK button.

You will see that the two analysis nodes as well as the workbook nodes are now displayed with a red frame around them; this denotes that these nodes are not up to date, or dirty. Let us finally move the arrow from the Standard Classification Trees with Deployment to point to the same (first) workbook where the descriptive statistics are displayed. To do this, click on the head of the arrow, and drag it over to the first workbook (release the mouse button as your cursor is hovering over the workbook node); also delete the now  disconnected  node (unless you'd like to keep it for reference).
 


Next click the Update button, or press F5.
 

Note that V-fold cross-validation is a time-consuming procedure that will validate each tree in the tree sequence several times; however, a significant benefit is that the program will now pick for you the best tree, (the one with the best cross-validation cost and node complexity trade-off).  

After all nodes are updated, double click on the (now single) results node to review all results in the workbook. Note that both the Descriptive Statistics as well as the results of the Standard Classification Trees analysis are displayed in the same workbook.

7. Deploy solution (models) for new data.

STATISTICA Data Miner includes a complete deployment engine for data miner solutions that comprises various tools. For example:

  • You can create Visual Basic or C/C++/C# program code in most interactive analysis modules that will compute predictions, predicted classifications, clusters assignments [such as General Regression Models, Generalized Linear Models, General Discriminant Function Analysis, General Classification and Regression Trees (GC&RT), Generalized EM & k-Means Cluster Analysis, etc.]. (See also, Using C/C++/C# Code for Deployment.)

  • You can create XML-syntax based PMML files with deployment information in most interactive modules that will compute predictions, predicted classifications, or cluster assignments (the same modules mentioned in the previous paragraph). One or more PMML files with deployment information based on trained models can be loaded by the Rapid Deployment of Predictive Models modules to compute predictions or predicted classifications (and related summary statistics) in a single pass through the data; hence this method is extremely fast and efficient for scoring (predicting or classifying) large numbers of new observations.

  • General Classification and Regression Trees and General CHAID modules can be used to create SQL query code to retrieve observations classified to particular nodes, or to assign observations to a node ( to write the node assignments back into the database).

  • Complex neural networks and neural network ensembles (sets of different neural network architectures producing an average or weighted predicted response or classification) can also be saved in binary form and later applied to new data.

In addition, Statistica Data Miner contains various designated procedures in the (Node Browser) folders Classification and Discrimination, Regression Modeling and Multivariate Exploration, and General Forecaster and Time Series, to perform complex analyses with automatic deployment and cooperative and competitive evaluation of models.

For example, the Classification and Discrimination folder contains nodes for stepwise and best-subset linear discriminant function analysis, various tree classification methods, generalized linear models procedures, and different neural network architectures.

The analysis nodes with automatic deployment are generally named TypeOfAnalysis with Deployment. Simply connect these nodes to an input data source, update (train) the project, and you are ready for deployment: Connect to the node a data source marked for deployment (select the Data for deployed project check box in the dialog specifying the variables for the analysis), and the program will automatically apply the most current model (tree classifier, neural network architecture) to compute predictions or predicted classifications.

Example. Start a new data miner project by selecting a predefined project for classification. From the Data Mining - Workspaces Data Miner - General Classifier (Trees and Clusters) submenu, select Advanced Comprehensive Classifiers Project.

Then, click the New Data Source button, and select the Boston2.sta data file again; specify Price as the categorical dependent variable, select variable Cat1 as a categorical predictor, and select variables ORD1 through ORD12 as continuous predictors. Click the Connect button and connect the data icon to the Split Input node, which is the main connection point for the Advanced Comprehensive Classifiers Project.

Now, click the Run button. A number of very advanced, and somewhat time-consuming, analyses will now be performed:

  • The Split Input node in the Data Preparation, Cleaning, Transformation area will randomly select two samples from the input data: One for training the various models for classification, and the other to evaluate the models,(for the observations in the Testing sample) the program will automatically compute predicted classifications and misclassification rates so that the Compute Best Prediction From All Models node (the one that initially is not connected to anything in the Data Preparation, Cleaning, Transformation area) can automatically pick the best classifier, or compute a voted best classification ( apply a meta-learner).

  • The program will automatically apply to the Training sample the following classification methods: linear discriminant analysis, standard classification trees (C&RT) analysis, CHAID, Exhaustive CHAID, a radial basis function neural network analysis, and a multiplayer perceptron.

  • Next, the program will automatically apply the trained models to the new data, ( the testing sample; the observations in that sample have not been used for any computations so far [estimation of the models]), so they provide a good basis for evaluating the accuracy of the predicted classifications for each model.

A large amount of output will be created:

You can review the results for each model in the respective results nuggets in the Reports areas; during the initial research stage of your data mining project, you probably would want to review carefully the models, and how well they predict the response of interest. You can also double-click on each of the analysis nodes to select different types parameters for the respective analyses; in that case you can use the Run to Node option (on the shortcut menu, or the Run menu) to update only the selected node.

You can also now connect new data, marked for deployment, to the Compute Best Prediction from All Models node (the one that is not connected to anything at this point in the Data Preparation, Cleaning, Transformation area). For this example, simply connect the Testing data (which was created as a random sample from the original input data source); then use option Run to node to compute the predicted classifications for each model.

After a few seconds, the results spreadsheet with predictions will be created as another node in the Data Preparation, Cleaning, Transformation area.

You can review the final predictions by selecting View Document from the shortcut menu, after clicking on the Final Prediction for PRICE icon (which contains the predicted classifications for variable PRICE from all models). For example, you can compute a multiple histogram for the accuracy for each classifier (also reported in the Final prediction for PRICE spreadsheet).

You could also look at the accuracy of classification, broken down by each category, and so on. In this case, it appears that all algorithms were reasonably accurate. By default, the Final Prediction spreadsheet will also contain a column with a voted classification from all classifiers. Experience has shown that predicted classification is often most accurate when it is based on multiple classification techniques, which are combined by voting (the predicted class that receives the most votes from participating models is the best prediction). Shown below is the categorized histogram of the voted classifications by the observed classifications.

Clearly the voted classification produces excellent accuracy in the test sample. Remember that the test sample was randomly selected from the original data, and was not used to estimate the models (train the networks, etc.).

8. Prepare project for final customer deployment (in the "field").

Once deployment information is stored after training nodes marked with deployment, for classification or prediction (regression problems), you can save the entire project, and later retrieve that file to compute predicted values for new observations. For example, a loan officer may want to predict credit risk based on the information provided on a loan application. The loan officer (end user, or customer) will not have to retrain the models in the current project again; instead he or she can simply connect new data to the prediction node (usually labeled Compute Best Prediction From All Models) and proceed to process the new data. In fact, the data analyst who created the prediction model from training (learning) data can delete all computational (analysis) nodes from the project, and only leave a single node for computing predicted responses. In a sense, such projects are "locked," (there is no risk of losing the deployment information due to accidentally starting a retraining of the models).

Advanced methods for deployment in the field. If you are familiar with Statistica Visual Basic (SVB), you can also consider writing a custom program that would further customize the user interface for the end-user of the deployed solution. If you review the function available for the  Statistica DataMiner library in the SVB Object Browser, you can see that practically all aspects of the Statistica Data Miner user interface can be customized programmatically; for example, you could attach the automatic application of a deployed solution to new data to a toolbar button so that a loan officer would only have to fill out a form with an applicant's data, click a button, and retrieve scores for credit risk and fraud probability.

Example. Suppose you want to send a fully trained and deployed solution to a customer or client, based on the analyses briefly discussed in number 7. You can simply delete all nodes other than the one labeled Compute Best Prediction From All Models, and save the project as MyDeployedProject.sdm. You may also want to rename the lengthy Compute Best Prediction From All Models to the simple instruction Connect New Data Here.

All deployment information will automatically be saved along with the project file.

See also, Data Mining Definition, Statistica Data Miner Summary, Data Mining with Statistica Data Miner, and Structure and User Interface of Statistica Data Miner.