What Is STATISTICA Data Miner Recipes (DMR)?

STATISTICA Data Miner Recipes (DMR) provide a systematic method for building advanced analytic models to relate one or more target (dependent) quantities to a number of input (independent) predictor variables. The target variables can be continuous or categorical. Continuous target variables are usually associated with regression tasks, and categorical variables are used in classification problems. STATISTICA DMR is capable of handling both types of variables and, thus, capable of building predictive models for tackling regression and classification problems.

STATISTICA DMR is a complete solution that makes the process of predictive model building and data mining a systematic and step-by-step process. Model building in DMR starts with preliminary data analysis, pre-processing the analysis variables, dimensionality reduction, and elimination of any redundancy that might exist in the data set. Once data definitions and preparation is complete, DMR can create various predictive models (such as neural networks, support vector machines, trees, etc.) for modeling the values of the target data from the input variables. This step is followed by model evaluation and, finally and most important of all, model deployment in which the predictive models can be used for making predictions on unseen (new) data (e.g., for "scoring" of databases).

In addition to a recipe-like user interface for building predictive models, STATISTICA DMR also supports the off-loading of computationally demanding tasks. With DMR , you also can save projections and reload them in the future for further deployment.

Core Analytic Ingredients

At the heart of STATISTICA Data Miner Recipes is a step-by-step recipe consisting of many analytic ingredients, which together create a self contained tool for creating and deployment of predictive models. The individual analytic methods and steps that need to be applied in a particular order may be available in various other scientific or academic domains; but the unique combination of analysis methods and steps in DMR has created an analytic work flow that can satisfy the needs of beginners and advanced users of data mining tools. STATISTICA DMR can be run as a nearly single-step data mining process (just specify variables or fields, and then run-to-completion), or can be used by experienced data mining practitioners to "host" sophisticated and "fine-tuned" data mining models (with data preprocessing, transformations, etc.) for deployment.

The various ingredients of STATISTICA DMR are implemented in unique and specific steps and the results are supported by various spreadsheets and graphs that are designed to aid the user to draw conclusions and interpret the results.

1. Data preparation. Essentially, in this first step, we prepare the data for modeling. Specific data cleaning and transformations procedures can be implemented to eliminate specific unusual and duplicate cases. In addition, a "blind-holdout-sample" can be selected to be used later for validating models. Also, the target (dependent) and input variables to the process are specified; targets are the variables or outcomes of interest that are to be predicted using the inputs (independent variables).

For example, suppose the task is to identify an accurate model for  predicting two important outcomes related to credit risk: default probability, and direct profit/loss over the lifetime of the loan. In this case, there would be two target variables (in the training data, i.e., a sample of individuals who had previously taken out a loan): 1) whether a respective individual defaulted on the loan, and 2) how much profit/loss overall accumulated over the lifetime of the loan. Typical predictors might be the credit rating of each person (at the time the loan was originated), average income, etc.

2. Data analysis. In this stage, you can conduct statistical analyses of your variables including the targets and the inputs. You can review various statistics of the data such mean, standard deviation, skewness, kurtosis, and observed minimum and maximum. You can also review the variable roles (inputs or targets) and their types (continuous or categorical). For regression analysis, the target variables are invariably continuous. For classification tasks, only categorical variables are chosen as target variables.

3. Data redundancy. Often, a number of variables can carry, to some degree, the same information. For example, the height and weight of people might in many circumstances carry similar information, as the two variables are correlated. In other words, from the height one can predict the weight with some accuracy. Thus, it may not be necessary to use both height and weight in the same analysis. Intuitively the exclusion of variables that may carry useful information may sound like a bad idea since the height and weight can never be related with perfect correlation, but in fact it is a consequence of the curse of dimensionality. It demonstrates that the benefits gained by reducing the curse of dimensionality can actually outweigh the loss of some information that might be incurred as a result.

In this step, a simple correlation test is applied to identify redundant inputs and remove them from further consideration for modeling. Note that the data redundancy scheme only applies to continuous inputs.

4. Dimension reduction. One of the important functionalities available in STATISTICA DMR is the ability to identify a small number of important inputs (for predicting the target) from a much larger number of available inputs (and is effective in cases when there are more inputs than cases or observations). Even after the previous step has been applied (data redundancy), usually a large number of inputs remain for model building. While many methods exist – and are typically in use – to "screen" inputs to identify those that appear to be related in some way to the target variable of interest, a major analytic challenge is to find the interactions between inputs that predict the target of interest.

For example, predicting (modeling) the performance of a boiler requires a particular "configuration" of settings of different multiple inputs; also, the most simple monotone relationships (the more of X the more of Y) are usually known already. The accurate prediction of credit or insurance risk (from inputs beyond commonly known and obvious risk factors) also usually requires the identification of a specific "configuration" of demographic and other variables that are related to risk. Interactions between inputs often cannot be explicitly screened entirely, because even with as few as 100 inputs, there can be more than 160,000 interactions between, for example, 3 specific inputs that can be arranged out of 100 inputs. STATISTICA DMR uses tree based algorithms for finding important input predictor variables and interactions among them.

5. Model building. In this step, the actual models for predicting the targets from the inputs are built. Traditionally, building predictive models often falls under the domain of Data Mining or Statistics. In STATISTICA DMR, the goal is to largely "automate" the process of generating good (accurate) predictive models; thus, by default the program will automatically search a specified number of different predictive models such as various tree models, support vector machines, and neural networks. For the latter models such as neural networks, DMR automatically chooses good "candidate models" for further consideration. These computations can be time consuming and, hence, can be off-loaded to the server (from the desktop computer) where results can be picked up later, or even the next day. Also, a large number of graphical displays are available to the Model Builder to review how well the different models predict the targets of interest. However, model building and selection is mostly automated to empower subject experts in the domain of interest (e.g., engineers who serve in the Model Builder role) rather than statisticians or data mining professionals to quickly and effectively build accurate predictive (e.g., neural networks) models.

6. Model evaluation. By the time you reach this step, you already have built your predictive models. Like any tool, your predictive models need to be tested on data that were not presented to the models during their training. This is also very similar to quality control, which needs to be applied to items coming out of production lines to ensure they meet certain specifications and standards. To do this, you test your models with data sets that were unseen before. In this case, the validation data set can help. The aim here is to see how well your models will perform on future data during the later and most important stage of deployment. The ability to predict new data is known as generalization. If your models did not generalize well on the validation data, it is recommended that you investigate the conditions and settings under which they were built and try creating more models that meet your needs.

7. Deployment. After building your data mining models using STATISTICA DMR, you can put your predictive models to "use" for predicting future or "new" data as needed. The process of using predictive models for predicting data that were not used in training the model is known as deployment. Deployment is by far the most important state of predictive modeling, and it is indeed the ultimate goal of the model builder. It is also the stage where your predictive models face the test of the real world. A good predictive model is one that predicts unseen data with the desired accuracy.

STATISTICA Data Miner Recipes provides a direct interface to STATISTICA Enterprise Server to "attach" fully trained data mining models (data miner recipes) to data configurations for automated scoring of new data (e.g., new credit applications) in a Web-based solution and predict expected outcomes (e.g., for a continuous manufacturing process, and to track prediction residuals in standard QC charts).

See also, STATISTICA Dater Miner Recipes Data Requirements.