Data Preparation - Data Preparation Tab

During the Data preparation step, use the options on the Data preparation tab to open or connect a data file for the analysis from a local machine or from a URL server. You can also use this section to apply data transformations, identify and review specific variables for the analysis, select a labeling variable for identifying specific case names to be used in the analysis, remove duplicate cases from the data and specify the use of a sample data set.

Open/Connect data file. Click this button to display the Select Data Source dialog box and select the data file for the analysis. Data Miner Recipes data files are save in the standard Statistica format with extension *.sta.

Note that if you have completed subsequent steps in this project, opening a different data file will invalidate those steps. For this reason, if you click the Open/Connect data file button while a project is in process, you will be asked if you want to delete the subsequent steps of the project (if any). Click No to cancel and Yes to proceed, in which case you will be prompted to save the current project with the display of a Save as dialog box.

Apply data transformations. Click this button to display the Batch Transformation Formulas dialog box, which contains options to supplement the data transformation formulas built into the Statistica spreadsheet. You can enter several transformation formulas into a text editor and evaluate these transformations in sequence, one by one. Any transformation you choose will also be applied to new data during deployment.

Select variables. Click this button to display a five-list variable selection dialog box, which is used to choose variables for the analysis. You can select continuous and/or categorical targets, continuous and/or categorical predictors, and a validation sample variable.

Select label(s). Click this button to display a single list variable selection dialog box, which is used to identify one or more labeling variables for identifying specific cases from the data set for use in the analysis. Note that case labels/IDs must be unique. If the selected variable does contain duplicate labels (e.g., two or more cases with the same label/ID) or the case names contain duplicate labels, the Next step will fail and you will be prompted to review the data set for duplicate names before continuing to the next step.

Use sample dataset. Select the Use sample dataset check box to extract a random sample from the original data set and use that sample as the data for the analysis. By default, this check box is selected for large data files and cleared for smaller data sets. When this check box is selected, the default selection is to use Systematic random sampling with K =1. Additional sampling methods can be specified on the Advanced tab.

Remove duplicate record(s). Select this check box to remove duplicate record(s) from the data set. Options for defining which records are duplicates are available on the Advanced tab.

Variables. After variable selection has been made, you can review the type (i.e., continuous or categorical) and role (i.e., input, output, validation sample) of the selected variables. Changes to a variable’s type or role are made by clicking the Select variables button.