Data Preparation - Advanced Tab

During the Data Preparation step, the following options are available on the Advanced tab.

Use sample data. Use the options in this group box to create a new output data spreadsheet that is a sample of the input data set. Select the sampling method you want to use, and then click the More options button to view sampling options specific to the selected method.

Systematic random sampling. This method computes and creates a new output data spreadsheet consisting of all selected variables and a random subset of the cases. When you select Systematic random sampling and click the More options button, the Systematic random sampling dialog box is displayed, where you can specify the K-value (common distance between values selected from the original data set).

Stratified random sampling. This method computes and creates a new output data spreadsheet as a stratified random sample of the input data. Use this option to systematically over-sample rare events, for example, for predictive classification projects. When you select Stratified random sampling and click the More options button, the Stratified sampling dialog box is displayed, where one or more stratification variables can be selected, and the user may specify either (sampling) percentages or approximate numbers of cases for each stratum. Constant sampling rates for all strata and additional sub-setting of variables and/or cases can also be requested.

Simple random sampling. This method computes and creates a new output data spreadsheet as a random sample of the input data. When you select Simple random sampling and click the More options button, the Simple random sampling dialog box is displayed, where you can choose to either use a specified percentage of the cases or an approximate constant number of cases. You can also set the seed for sampling.

Remove duplicate records (cases). Select the Remove duplicate records (cases) check box and click the Duplicate records (cases) button to display the Select variables to define duplicate records dialog box. Use this single variable selection dialog box to select any number of variables that specify the basis of distinction for de-duping the data set. When this check box is selected, Statistica will detect and remove duplicate records during the run and validation process.

Valid data range. Select the Valid data range check box and click the Valid data range button to display the Missing data and Invalid Case Definition dialog box. Use the options in this dialog box to specify a minimum and maximum value for each of the selected variables. Cases with values outside the specified range will be treated as invalid data.

Remove outlier. As part of the Data preparation step, you can perform outlier analysis on one or more of the selected variables. Select the Remove outlier check box and click the Outlier button to display the Outlier and Extreme Value dialog box. Use the options in this dialog box to select the variables for outlier analysis and specify how to treat outliers once they are detected. The Set to boundary option (located in the Outlier and Extreme Value dialog box) will iteratively recode outliers and extreme values to +/-3 standard deviation limits.

Missing data. Select the Missing data check box and click the Missing data definition button to display the Missing data definition dialog box, where you can specify the type of algorithms for handling missing cases, including the methods of mean substitution and casewise deletion, for each variable in the analysis.