Rapid Deployment of Predictive Models Overview

The Statistica Rapid Deployment of Predictive Models module will quickly generate predictions from one or more previously trained models based on information stored in industry-standard PMML (Predictive Model Markup Language) deployment code. This information can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML is an XML-based language for encoding information (results) from data mining projects. The Rapid Deployment of Predictive Models module is particularly well suited for generating predictions for a large number of observations (cases) because it passes (reads) through the data once, storing only the data for a single observation at a time.

Rapid deployment can ingest and produce predictions for two different types of PMML models.

a. Tree models that contain surrogate information. A sample surrogate tag in PMML is below. The modules generating this PMML are ITrees, General C&RT and General CHAID.

<CompoundPredicate booleanOperator="surrogate">

   <SimplePredicate field="MEASURE02" operator="lessOrEqual" value="1.50000000000000e+000"/>

 <SimplePredicate field="MEASURE01" operator="lessOrEqual" value="5.00000000000000e-001"/>

 <SimplePredicate field="GENDER" operator="equal" value="FEMALE"/>

   <False/>

   </CompoundPredicate>

b. Scaled Logistic Regression models. These models are generated from the GLZ module and basically return a score value for each case. There is no residual column computed for such models.

Deploying multiple models

The Rapid Deployment of Predictive Models module can evaluate multiple models simultaneously, and will generate results to compare the predictions from different models. You can also save the data for further processing, along with other variables in the current input data file. This capability is extremely useful when performing detailed analyses of the predictive power of different models.

Rapid Deployment can predict for multiple models of the following type:

One dependent variable Y as a function of X: Y=f(x) and all the loaded models must have the same Y and same list of X.  

All models must have same Y but may have different list of X, i.e., for example, load and score two models Y=f(x1, x2) and Y=f(x3, x4) together.

Writing Statistics to an External Database

With the Rapid Deployment of Models module, you can write computed statistics (predictions, predicted classifications, classification probabilities, residuals) back into the current input data file; this option, on the Rapid Deployment of Predictive Models dialog box - Quick tab, is available for Statistica Spreadsheets as well as external databases, connected via Streaming Database Connector. This capability to, for example, merge classification probabilities computed by various models into an existing database or data warehouse is extremely useful in the context of data mining applications to deploy models for extremely large data sets (e.g., to compute probabilities that particular customers in a large database of customers are likely to purchase from a mail-order catalogue). Because the processing of large data sets in (remote) external databases via Steaming DB Connector is extremely efficient (e.g., requiring very little memory on the computer running the Rapid Deployment of Models module), this method of deploying fully trained models for data mining will scale easily to even extremely large data sets.

Configuring the Streaming Database Connector for writing. In order to take advantage of the ability to write computed statistics for observations back into the database, the Streaming Database Connector must be properly configured (e.g. for read/write access in the Query Options dialog box). Also, the database fields (variables) to which you want to write must already exist in the database, and must be of the correct type (e.g., you cannot write numeric information into data fields of type Text). To learn more about the options to configure the connection, refer to Streaming Database Connector Technology and the Query Options dialog box.

Analysis Modules (Models) that Generate PMML Code

The following analytic modules for predictive data mining will generate deployment code in PMML code, and are therefore compatible with the Rapid Deployment of Predictive Models module:

Linear Least-Squares Models

Multiple Regression

General Linear Models (GLM)

General Regression Models (GRM)

General Discriminant Function Analysis (GDA)

Nonlinear Models

Generalized Linear/Nonlinear  (GLZ) Models

Multivariate Adaptive Regression Splines (MARSplines)

Tree Models

General Classification and Regression Trees (GC&RT)

General CHAID and Exhaustive CHAID (GCHAID)

Interactive Trees (C&RT, CHAID)

Boosted Tree Classifiers and Regression

Random Forests for Regression and Classification

Clustering (Unsupervised Learning and Predictive Classification)

Cluster Analysis (Generalized EM, K-means & Tree)

Neural Networks

Neural networks models can be saved in PMML format and evaluated by the Rapid Deployment of Models module if the respective model or ensemble of models predicts only a single continuous or categorical dependent or outcome variable; use the respective features for applying fully trained networks in Statistica Automated Neural Networks (SANN) to simultaneously predict multiple continuous and/or categorical outcomes (see also the deployment of models in Statistica Automated Neural Networks).

PMML Extensions

Even though the PMML standard is a promising development to bring cross-platform and cross-application compatibility to data mining, it currently can accommodate only fairly simple implementations of the methods that are defined. Therefore, in most cases, special extensions had to be added to the standard in order to allow users to take advantage of the advanced implementations of the respective methods available in Statistica.

Program Overview

The Statistica Rapid Deployment of Predictive Models module will read multiple PMML files to compute predicted values or classes from trained models. This information can optionally be written into the current input data file (or database if the current input data is a query into an external database via Streaming Database Connector) for subsequent analyses involving other variables in the current input data file or data warehouse. PMML code can be generated by practically all modules for predictive data mining available in Statistica, including the clustering methods (EM, K-means & Tree) available in the Cluster Analysis (Generalized EM, K-Means & Tree) module. When applicable, the program will compute predicted values, quality of fit indices (when observed values are provided), and simple or overlaid lift and gains charts for binomial or multinomial classification problems.