Select Dependent Variables and Predictors

Produce a downstream data source node via an SVB node. Connect an analysis node to the downstream data source node. Double-click the downstream data source node to display the Select Dependent Variables and Predictors dialog box, which contains two tabs: Quick and Advanced. Use the options on these tabs to define the variables for the current analyses; all analysis nodes or data cleaning and filtering nodes connected to the respective data source will use the applicable information for the requested types of analyses.

Variable selections for different types of analyses. Because of the different nature of the types and classes of analyses, the typical analysis node will not use all variables and specifications selected in this dialog box. For example, a Censoring indicator variable can be specified on the Advanced tab of this dialog box, but it will only be applicable to analyses involving censored observations (e.g., Survival Analysis). Data sources and their descriptions (variable selections, etc.) in Statistica Data Miner can best be thought of as data objects that can be freely moved around a data mining project, connected to and disconnected from various nodes, or dragged from one data miner workspace into another.

Specification of variables for data sources created from analyses.  A typical operation in data mining is to specify transformations or data cleaning operations and to connect the filtered or transformed data to subsequent analyses.

When such a "system" of consecutive operations is updated, by default, the variable selections for each data source that is created by a node (e.g., Out 1, Out 2, and Out 3 in the illustration shown above) are automatically replaced (overwritten). This may be desirable when the nodes that create the data sources for further analyses will automatically set the correct variables of interest. However, in other situations this may not be desirable. For example, if you want to perform a particular analysis on prediction residuals from a regression analysis, you may want to select those residuals (created by the Regression node) as a dependent variable of interest for subsequent analyses; and you don't want to overwrite those specifications each time the data mining project is updated (recomputed).

The Select dependent variables and predictors dialog box for data sources that are created from computations performed by a node contain an option to Always use these selections, overriding any selections the generating node may make. Set this option if you want to specify a fixed selection of variables for subsequent analyses.

Data for deployment project; do not re-estimate models. This option is only applicable to analyses that can automatically generate deployed solutions, which can be applied to new data (e.g., all analytic nodes in the Classification and Discrimination or Regression Modeling and Multivariate Exploration folder of the All Procedures Node Browser configuration; typically, all nodes that support automatic deployment of models or solutions are named as "{Type of Method or Model} with Deployment"; see also Getting Started and  General Architecture of Statistica Data Miner for details). Analytic nodes that will automatically generate information for deployment can either use the input data to fit or estimate the respective type of model (e.g., perform a multiple regression analysis), or apply a previously estimated or fitted model to new data to compute predicted values or classifications (e.g., apply a linear multiple regression equation to compute predicted values to new observations or measurements). For those analytic nodes, you can use this option to mark the respective data source to be used for estimating or fitting the model(s), or to be used for deployed projects or models only.

Processing order for data marked for deployed projects. When multiple data sources are connected to the same data cleaning or filtering, or analytic node, the order in which different data sources are evaluated by those nodes is generally not fixed (predictable); however, data sources that are marked for deployed projects are always evaluated after all other data sources were processed that were not marked for deployed project. This is an important feature, because it allows you to connect data to analytic nodes that automatically generate information for deployment, and connect to the same nodes data sources marked for deployment; the program will then estimate model parameters from the "training data" (not marked for deployment), and apply the model to the "testing data" (marked for deployed projects).

Predicting new observations, when observed values are not (yet) available. One of the main purposes of predictive data mining (see Concepts in Data Mining) is to allow for accurate prediction (predicted classification) of new observations, for which observed values or classifications are not (yet) available. An example of such an application is presented in  Example 3 (see also Example 4). When connecting data for deployment (prediction or predicted classification) to the nodes for Classification and Discrimination or Regression Modeling and Multivariate Exploration, ensure that the "structure" of the input file for deployment is the same as that used for building the models. Specifically, make sure that the same numbers and types of predictor variables are specified, that a (continuous or categorical) dependent variable is specified (even if all values for that variable are missing), and that the variable names match those in the data file used to build the models (this is particularly important for the deployment of neural networks, which will rely on this information).

OK. Click this button to accept the selections you have made and exit this dialog box.

Cancel. Click this button to exit the dialog box without making any selections.

See also, Data Mining Definition, Data Mining with Statistica Data Miner, Structure and User Interface of Statistica Data Miner, Statistica Data Miner Summary, and Getting Started with Statistica Data Miner.