STATISTICA Data Miner Recipes Data Requirements

Continuous and Categorical Inputs

The model-building techniques in STATISTICA Data Miner Recipes (DMR) will work with continuous as well as categorical inputs. For categorical inputs, the program will sometimes (but not always, depending on the data mining algorithm) create (internally) ”dummy variables” with indicator codes (0/1) for each individual class or category found in the input.

Large numbers of categories and partitioning into training and testing data. The effectiveness of DMR can deteriorate when one or more inputs are categorical in nature and contain many categories. For example, in applications with credit-scoring data that contain many categorical predictors [i.e., home-town (ZIP-codes), types of previous loans issued, etc.] it has been found that careful data pre-processing may greatly enhance the effectiveness of the DMR methodology. Otherwise, the predictive model may end up with too many input variables, which can deteriorate the performance through the curse of dimensionality (see below).

This issue is compounded when a validation or hold-out sample is specified during model building. For example, it can easily happen that particular categories are not (randomly) selected into the specific sample used for model building but instead are completely selected into the hold-out sample. Obviously, when a category is not represented during model building itself, the respective algorithm cannot ”learn from it”; hence, the model cannot make predictions based on those missed-out categories.

If the issue of too many categorical variables does come up in your application (which may be the case in financial data), we recommend that you use the various methods available in STATISTICA Data Miner for carefully pre-processing your data, e.g., to combine categories while maximizing the ”possible relationships” to the target(s) of interest. Refer to STATISTICA Data Miner methods and documentation for more details.

Curse of dimensionality. The inclusion of an input in a predictive modeling algorithm, such as neural networks for example, adds another dimension to the space in which the data cases reside. The more inputs a neural network has, the more data points we need to train the network effectively so it can capture the underlying structure in the data (i.e., to model the relationship between the input-target variables). Thus, with the addition of every input, the number of data points needed to train the network will grow rapidly. This is known as the curse of dimensionality (Bishop 1995).

In general, whenever possible the data analyst should perform an initial "common-sense" screening of (large numbers of) inputs, and not include "obviously unnecessary" inputs. In addition, one may even try to identify those inputs that carry a small amount of information regarding the prediction problem and eliminate them from the analysis. Although this may lead to some loss of information in the data, it may, on the contrary, considerably reduce the cure of dimensionality, which can significantly increase the performance of the neural network.

STATISTICA DMR employs two methods for combating the curse of dimensionality. In the data redundancy stage (for more information on DMR stages, see What is STATISTICA Data Miner Recipes (DMR)?), DMR identifies those variable pairs that contain similar information (i.e., correlated variables), and eliminates one of them, thereby reducing the dimensionality of the problem. Further dimensionality reduction is applied in the dimension reduction step, where DMR can use a tree-based algorithm (as well as the much faster single-pass predictor screening algorithm) for identifying and eliminating those variables that contain little or no relevant information about the target variables.

Measuring the Wrong Inputs

So, what if you could not find any valid and accurate predictive models using STATISTICA DMR (e.g., in the model building step, the correlations between observed and predicted target values for the hold-out or validation sample are less than .4 or so)? You should continue to try and retry to find ”good models” using different settings in the model building step.

If you continue to fail in finding ”good models,” the inevitable conclusion should be either:

(1) There is a large amount of noise on the target variables, which masks the real signal (i.e., small signal to noise ratio).

(2) There is no strong relation between the input and target variables.

In the first case, you should refine the data set by including more accurate measurements. Meanwhile in the latter case, you should ask the question: Why are you measuring these inputs in the first place? If you cannot find accurate models for the given inputs and targets even though you have applied the most general and advanced neural networks methods for building predictive models, then the inputs are not relevant for the targets. If the target variable of interest is related to important outcomes for your business, e.g., product quality, credit risk, etc., then it follows that none of the input parameters (i.e., inputs) currently measured and stored to describe the process are relevant for the quality of the outputs from the process.

This outcome may occur and is in itself an interesting result. If the inputs (process parameters) are not relevant to outcomes (targets), then why measure (or buy information on) them in the first place? Not measuring the right inputs is a serious problem since it means that you are not ”looking at the right things” in order to improve quality.

When the STATISTICA DMR model building step fails to produce accurate models, ask yourself whether the current measurement system used to describe and ”track” and predict your processes is useful. Of course, there is always the possibility that the predictive data mining algorithms fail to detect the extremely complex relationships that connect the inputs to the targets. However, in our experience, if you fail to find a good predictive model after letting DMR search through a substantial amount of potential models and methods of high complexity, there is a fair chance that the measurement and data collection system should be reconsidered.

Data Preparation: Missing Data, Outliers, Overall Data Quality

Another issue that may foil the successful implementation of STATISTICA DMR is that the data available for modeling are ”buggy” and not ”reliable.” There are simple and clear-cut recommendations to remedy poor-quality data. Again, if the quality of the data is so poor that it cannot be used for model building, then it is likely that it also cannot be used for reporting purposes or to satisfy requirements for regulatory compliance (e.g., FDA 21 CFR Part 11 requirements).

Generally in data analysis, data preparation is the most important activity used to ensure success.

Missing data, outliers. The recipe for model building tools acknowledges that missing data and ”bad measurements” (outliers) typically occur in data and provides options for dealing with these data issues. Specifically, cases with missing data and outliers can be replaced by means (for the respective continuous inputs), or they can be removed from the data for model building. Both of these methods work generally well as long as no more than 10% to 15% of the original data are ”modified” in this manner. If more than 15% of the data cases available for model building have missing or bad data values (outliers), then you should explore the differences between the cases (observations) that are missing and those that are not. The STATISTICA analytic platform has a large number of general and graphical options to visualize and review whether the missing data or outliers are different (with respect to the target variables of interest or other inputs) from those cases (observations) that have complete and ”good” measurements for all variables. For example, if you find that when a particular input has missing data, the quality (target values) of the respective cases is generally better than that of cases where the input is recorded (not-missing), then this finding clearly deserves further scrutiny.

Analyzing Very Large Data Sets and Sampling

In some applications, very large data sets are available for model building. In general, this is a good situation (better than too little data) in that a lot of raw data are available for STATISTICA DMR to use in building accurate and useful predictive models. However, the larger the data set the more processing time is required to build (train) the models. This is especially true of support vector machines and tree algorithms, which can grow in size with the size of the data set.

Why sub-sampling of data is useful. The fact that data for model building in DMR need to be stored in the computer’s memory does not, however, pose a problem at all. In general, it is good practice to sub-sample the cases from a large data set available for modeling and leave multiple hold-out samples to validate the predictive model.

At first, this issue may not be clear or intuitive. The accuracy and representative nature of statistical estimates from samples depend only on the (reasonable) sample size, not the population from which the sample was drawn. For example, if you take a sample of 1,000 cases from a large data set and compute a mean for a given variable, you usually get excellent precision (i.e., the confidence bounds for the mean will be very small) regardless of how large the complete data set is. So, in other words, the accuracy of your models will usually be just as good when you take a reasonable subset of the (large) data available for modeling. In addition, doing so then makes the remaining cases available for validation. After model building, if the accuracy of the final models is found to be very good in the data that were not used for building models, then you can generally be more confident that the models have good predictive validity.

These issues are discussed in greater detail in the STATISTICA Data Miner documentation (see Using STATISTICA Data Miner with Extremely Large Data Sets). In short, by building models based on large numbers of cases (for huge data sets) rather than sub-sampling, you actually ”throw away” useful information because you ”use up” possible opportunities to evaluate the predictive accuracy of models in several hold-out (validation) samples.

How many cases (observations) should you have for DMR modeling? That question is difficult to answer. In general, the number of cases (observations) should be a multiple of the number of inputs that are used for model building (e.g., 10 to 100 times as many cases as variables is sometimes used as a rule of thumb). However, the number of inputs that are available for building predictive models may itself be very large (see below) in which case using even fewer cases will still yield accurate models. But when practical, this rule of thumb can be used as a general guideline as long as the overall file size (submitted to, for example, neural networks) is within reasonable limits (e.g., less than 100,000 data points overall). Note that STATISTICA DMR is not limited to any particular file size. Even very large data sets can be analyzed; however, long processing times may be expected in this case.

Large numbers of inputs. Another issue is how to deal with large numbers of inputs (variables). In general, the DMR redundancy and dimension reduction steps are very effective for extracting the relevant diagnostic inputs from a large number of candidate inputs. However, in extreme cases (e.g., when there are thousands of inputs available, or a few categorical inputs with thousands of categories), it may be advisable to pre-process the list of inputs prior to applying DMR and identify a subset of inputs that are likely useful (diagnostic) for model building. Again, the STATISTICA Data Miner documentation explains various useful approaches to solve this issue (see, for example, the section on feature extraction and feature selection in the topic Using STATISTICA Data Miner with Extremely Large Data Sets).

NOTE. There is a relation between the number of inputs and the number of data cases needed to reliably build models. The more inputs we have, the more data cases we need to build models. This is known as the curse of dimensionality. Thus, it is beneficial to apply the data redundancy and dimension reduction steps of the DMR before actually start building models. That way you can build more reliable models with fewer data cases.

Predictive Models in DMR

STATISTICA Data Miner Recipes uses various predictive models to relate the target variables to the inputs. The use of more advanced analytic model makes DMR model building an effective tool that you can use in one stage for making collective predictions. Among these models, DMR uses Support Vector Machines (SVM), Classification and Regression Trees C&RT, Random Forests, Boosting Trees, and Neural Networks. For an overview of these tools, please consult the STATISTICA Data Miner Manual.