Using C/C++, C#, and Java Code for Deployment

Practically all Statistica modules will generate C/C++, C#, and Java code for trained models, which can be incorporated into compiled programs to compute predictions or predicted classifications for new observations. Obviously, writing and debugging compiled programs requires some experience and experimentation to make sure that all information is passed and processed as expected. So, as a general initial recommendation, we strongly urge you to carefully verify the predictions computed from your specific compiled programs by comparing the results to those computed, for example, by the Rapid Deployment of Predictive Models module (which computes those predictions or predicted classifications based on PMML-based deployment files).

Here are a few general recommendations and caveats to consider when debugging such C/C++ programs that incorporate code generated from Statistica Data Miner modules for predictive data mining.

1. In general, for deployment computer code generated from all modules, the deployment function expects as input the data for one observation or case. The values for each predictor inside that function should be placed in the same sequential position of the input array, as they appeared in the input data file. Also, location or variable 0 (zero) is reserved for case numbers ( for time-dependent models), so the (generated) programs will expect the locations of variables to be referenced to 1, instead of the customary (in C) 0 (zero).

For example, suppose you have a simple linear model with 2 predictors, which in the input data file are variables 5 and 7. The deployment function will expect the input values for those predictors to be located at input array (x) elements x[5] and x[7] (even though the first element in array x would be x[0]; hence, the deployed C code would expect the predictors to be the 6th and 8th element in array x).

2. Categorical predictors require special care: The Statistica program generally operates on numeric codes, even when the input variables are of type TEXT. In most cases, using deployed C/C++, C#, and Java code is straightforward. Simply use the same codes that were used in the data from which the respective model was estimated. So, for example, if in the input data Male is coded 1, and Female is coded 3, then use these values (1 and 3) in the input array for the deployment function to compute predicted values or classifications.

Categorical variables of type TEXT will be treated in the same manner as numeric variables with text values (codes). Usually, however, during the analyses the text-to-numeric-codes translation is done on-the-fly; i.e., numeric codes are assigned to text values as they are processed. There are two ways in which you can determine how the program made these assignments. You can either carefully review the results of the analyses (that generated the code) once more, and examine in any of the results spreadsheets how the text values are coded ( on the spreadsheet of predicted classifications, double-click on the column header for the predictions, and review the numeric values associated with the displayed text labels). A second method is to carefully review the actual generated C/C++, C#, or Java code that contains the codes (usually in an obvious location) used to process and correctly classify each input case.

 

See also, Data Mining Definition, Data Mining with Statistica Data Miner, Structure and User Interface of Statistica Data Miner, Statistica Data Miner Summary, Getting Started with Statistica Data Miner, and Statistica Automated Neural Networks (SANN) - Neural Networks: Overview.