Data Mining - Feature Selection - Feature Selection and Variable Screening

Ribbon bar. Select the Data Mining tab. In the Tools group, click Feature Selection. On the menu, select Feature Selection...

Classic menus. On the Data Mining - Feature Selection submenu, select Feature Selection and Variable Screening...

...to display the Feature Selection and Variable Screening dialog box. This module automatically selects subsets of variables from extremely large data files or databases connected via Streaming Database Connector. The module can handle a practically unlimited number of variables. Literally millions of input variables can be scanned to select predictors for regression or classification.

Specifically, Statistica includes several options for selecting variables ("features") that are likely to be useful or informative in specific subsequent analyses. The unique algorithms implemented in the Feature Selection and Variable Screening module will select continuous and categorical predictor variables that show a relationship to the continuous or categorical dependent variables of interest, regardless of whether that relationship is simple (e.g., linear) or complex (nonlinear, non-monotone). Hence, the program does not bias the selection in favor of any particular model that you may use to find a final best rule, equation, etc. for prediction or classification. Optionally, after an initial (unbiased) screening and feature selection step, further post-processing algorithms can be applied, based on CHAID, C&RT, MARSplines, Neural Networks, and linear modeling methods, to derive a final list of predictors.

Various advanced feature selection options are also available. This module is particularly useful in conjunction with the in-place processing of databases (without the need to copy or import the input data to the local machine), when it can be used to scan huge lists of input variables, select likely candidates that contain information relevant to the analyses of interest, and automatically select those variables for further analyses with other nodes in the data miner project. For example, a subset of variables based on an initial scan via this module can be submitted to the Statistica Neural Networks feature selection options for further review. These options allow Statistica Data Miner to handle data sets in the giga- and terabyte range.

See also, Data Mining with Statistica Data Miner.