Feature Selection and Variable Screening Introductory Overview


One of the common tasks in predictive data mining is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. Unfortunately, the standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables (even though most STATISTICA analysis methods can handle more than one thousand predictors).

The methods  implemented in the Feature Selection and Variable Screening (FSL) module are specifically designed to handle extremely large sets of continuous and/or categorical predictors, for regression or classification-type problems. You can select a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this module can serve as an ideal pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification available in Statistica and Statistica Data Miner.

Limitations of Methods for Prediction and Classification

Most techniques for prediction of continuous variables or predictive classification of categorical variables from k predictors and N cases require (a) to store matrices of size at least k*(k+1)/2, i.e., lower-diagonal matrices, (b) or to process the data file iteratively (repeatedly) to optimize some function of the data; typically, this requires that the complete data set of size k*N is stored for fast access.  

For example, to perform the necessary computations for a simple linear regression analysis or linear discriminant function analysis, the program first needs to compute the matrix of all predictor variables. The memory requirements to hold the correlation matrix (in lower-diagonal form, i.e., of size k*(k*1)/2) quickly exceeds the limits of what can be handled on most hardware platforms. Second, to fit linear models, the program must invert the correlation (covariance) matrix, which numerically is a very difficult task when the input matrix is very large (e.g., k=2000; containing (2000*20001/2)=2,001,000 elements), because round-off errors will accumulate and finally "overwhelm" the computations to the point where the final results can be numerically quite unstable.

"Curse" of Dimensionality

The term curse of dimensionality (Bellman, 1961, Bishop, 1995) generally refers to the difficulties involved in fitting models, estimating parameters, or optimizing a function in many dimensions, usually in the context of neural networks. As the dimensionality of the input data space (i.e., the number of predictors) increases, it becomes exponentially more difficult to find global optima for the parameter space, i.e., to fit models. In practice, the complexity of neural networks becomes unmanageable when the number of inputs into the neural network exceeds a few hundreds or even less, depending on the complexity of the respective neural network architecture. Hence, it is simply a practical necessity to screen and select from among a large set of input (predictor) variables those that are of likely utility for predicting the outputs (dependent variables) of interest.

Computing Linear Statistics

One potential solution to the problem of selecting predictors from a very large set of candidate predictors is to compute simple correlations between each predictor and the dependent variable of interest. For regression-type problems (with a continuous dependent variable), the program could simple compute the standard correlation coefficients and then select from among the predictors those that show the highest correlation with the dependent variable. The problem with this approach is that the standard correlation coefficient specifically measures linear relationships; however, many relationships in real-world applications are nonlinear, and many (most) of the data mining specific algorithms for predictive data mining, in fact, do not assume a linear or even monotone relationship between the predictors and the dependent variable of interest (e.g., Statistica Automated Neural Networks, General Classification and Regression Trees (GC&RT), General CHAID Models, Generalized Linear/Nonlinear Models, Generalized Additive Models (GAM), etc.). Therefore, employing this strategy would bias the results of any further analyses, because the screening process itself would favor particular models and results (linear models, and models of monotone relationships) and hence "preordain" certain results.

Variable Screening in Statistica Feature Selection and Variable Screening

The solution to feature selection in this module does not assume any particular type or shape of relationship between the predictors and the dependent variables (classes) of interest. Instead, the program will apply a generalized "notion of relationship" while screening the predictors, one by one, for regression or classification problems. Therefore, the predictor list selected in this manner can be further submitted to linear or nonlinear regression or classification algorithms, allowing you to employ a two-stage model building procedure for problems with hundreds of thousands or even more than one million potential predictors. The method used in Statistica Feature Selection and Variable Screening (FSL) is optimized for extremely large data sets and in-place processing of large databases, and will generally only require two passes through the data. The algorithm is further elaborated in Computational Details.

Capitalizing on Chance

As a final comment, it should be noted that the procedures implemented in this module should not be used with or, rather, not be "mixed in with" traditional statistical methods for hypothesis testing. Specifically, it would simply be wrong and "cheating" if one were to select using these methods 10 predictors from 10,000 possible predictors, then fit some model, and apply statistical significance testing to the results that are then interpreted as if they were predicted all along (a-priori). Obviously, the statistical significance levels in this scenario could not be interpreted as the usual alpha-error rates for testing of a-priori hypotheses (e.g., probability of obtaining relationships between predictors and the dependent variable as strong or stronger, when in fact in the population no relationships exists). Nevertheless, and as discussed earlier in this topic, it is simply a practical necessity in many data mining tasks to screen for and extract likely predictors from a very large list of candidate predictors. So the predictors chosen and results obtained via the Statistica Feature Selection and Variable Screening (FSL) module should be interpreted in most cases as heuristic solutions that can be extremely useful for identifying groups of variables that might deserve further scrutiny (and validation) for predictive data mining and model building.