General Best-Subset and Stepwise Discriminant Analysis

General best-subset and stepwise discriminant analysis; builds a linear discriminant function model for continuous and categorical predictor variables, using ANCOVA-like designs. The parameters in Statistica allow full access to the GDA syntax for specifying ANCOVA-like models, and for controlling the parameters for stepwise and best-subset selection of predictor effects (for categorical and continuous predictor variables). Note that the algorithm for stepwise and best subset selection of categorical factor effects ensures that complete (possibly multiple-degrees-of-freedom) effects are moved into and out of the model.

The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. However, most - if not all - textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No 'experience'(in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors, should be considered a heuristic search method, rather than a statistical analysis technique.

General

Detail of computed results reported. Detail of computed results; if Minimal level of detail is requested, the output contains Chi-square tests of roots, discriminant (canonical) function coefficients, factor structure coefficients, and classification function coefficients. If All results is requested, Statistica will also report various descriptive statistics and classification summary statistics. Classification statistics for each case can be requested separately as an option.

Analysis syntax. Analysis syntax string for General Discriminant Function Analysis (GDA) models; you can specify here the complete syntax, as, for example, copied from a Statistica analysis. Set this string to empty, or just GDA; to create the syntax from the specific options selected below.

Design. Required; specify the design for the between group (ANCOVA-like) design (categorical and continuous predictors); default is NONE.

Use the syntax:
DESIGN = Design specifications

Example 1.
DESIGN = GROUP | GENDER | TIME | PAID; {makes a full factorial design}

Example 2.
DESIGN = SEQUENCE + PERSON(SEQUENCE) + TREATMNT + SEQUENCE*TREATMNT;

Example 3.
DESIGN = MULLET | SHEEPSHD | CROAKER @2; {Makes factorial design to degree 2}

Example 4.
DESIGN = TEMPERAT | MULLET | SHEEPSHD | CROAKER - TEMPERAT; {Removes main effect for TEMPERAT from factorial design}

Example 5.
DESIGN = BLOCK + DEGREES + DEGREES*DEGREES + TIME + TIME*TIME + TIME*DEGREES;

Model building method. Specifies a model building method.

Priors. Set the prior classification probabilities for classifying observations. The default specification is Estimated; use this option to set the prior classification probabilities proportional to the observed group (class) N's; use the Equal option to assign equal probabilities to each group or class specified in the categorical dependent variable.

Case statistics. Creates and reports selected case statistics.

Sweep delta 1.E-. Specifies the negative exponent for a base-10 constant Delta (delta = 10^-sdelta); the default value is 7. Delta is used (1) in sweeping, to detect redundant columns in the design matrix, and (2) for evaluating the estimability of hypotheses; specifically a value of 2*delta is used for the estimability check.

Inverse delta 1.E-. Specifies the negative exponent for a base-10 constant Delta (delta = 10^-idelta); the default value is 12. Delta is used to check for matrix singularity in matrix inversion calculations.

Generates data source, if N for input less than. Generates a data source for further analyses with other Data Miner nodes if the input data source has fewer than k observations, as specified in this edit field; note that parameter k (number of observations) will be evaluated against the number of observations in the input data source, not the number of valid or selected observations.

Parameters for stepwise selection

Stepwise selection criterion. Specifies the criterion to use for stepwise selection of predictors. Note that the F statistic (criterion) is only available for analysis problems with continuous (single degree of freedom) predictors; for ANCOVA-like designs with factor effects for categorical predictors, only the Probability criterion is applicable.

p to enter. Specifies p-to-enter for stepwise selection of predictors.

p to remove. Specifies p-to-remove for stepwise selection of predictors.

F to enter. Specifies F-to-enter for stepwise selection of predictors; note that the F statistic (criterion) is only available for analysis problems with continuous (single degree of freedom) predictors; for ANCOVA-like designs with factor effects for categorical predictors, only the Probability criterion is applicable.

F to remove. Specifies F-to-remove for stepwise selection of predictors; note that the F statistic (criterion) is only available for analysis problems with continuous (single degree of freedom) predictors; for ANCOVA-like designs with factor effects for categorical predictors, only the Probability criterion is applicable.

Maximum number of steps. Specifies maximum number of steps for stepwise selection of variables.

Parameters for best-subset selection

Best subsets measure. Specifies the selection criterion for best subset selection of predictors. To use cross-validation misclassification rates, a cross-validation variable (learning sample) must be specified.

Start for best subsets. Specifies the smallest number of predictors to be included in the model chosen via best subset selection, i.e., the start of the search for the best subset of predictors.

Stop for best subsets. Specifies the maximum number of predictors to be included in the model chosen via best subset selection.

Number of subsets to display. Specifies the number of subsets to display in the results; Statistica will keep a log of the best k predictor models of any given size, using k as specified by this parameter.

Number of variables to force. Specifies the number of predictors to force into the model, i.e., to select into all models considered during the best-subset selection of predictors. Statistica will force the first k predictors in the list of continuous predictors into the model, with k as specified here by you.

Deployment. Deployment is available if the Statistica installation is licensed for this feature.

Generates C/C++ code. Generates C/C++ code for deployment of predictive model.

Generates SVB code. Generates Statistica Visual Basic code for deployment of predictive model.

Generates PMML code. Generates PMML (Predictive Models Markup Language) code for deployment of predictive model. This code can be used via the Rapid Deployment options to efficiently compute predictions for (score) large data sets.

Saves C/C++ code. Save C/C++ code for deployment of predictive model

File name for C/C code. Specify the name and location of the file where to save the (C/C++) deployment code information.

Saves SVB code. Save Statistica Visual Basic code for deployment of predictive model

File name for SVB code. Specify the name and location of the file where to save the (SVB/VB) deployment code information.

Saves PMML code. Saves PMML (Predictive Models Markup Language) code for deployment of predictive model. This code can be used via the Rapid Deployment options to efficiently compute predictions for (score) large data sets.

File name for PMML (XML) code. Specify the name and location of the file where to save the (PMML/XML) deployment code information.