SANN - Data Selection - Sampling (CNN and ANS) Tab

Select the Sampling (CNN and ANS) tab of the SANN - Data selection dialog box to access the options described here. The performance of a neural network is measured by how well it generalizes to unseen data (i.e., how well it predicts data that was not used during training). The issue of generalization is actually one of the major concerns when training neural networks. When the training data have been overfit (i.e., been fit so completely that even the random noise within the particular data set is reproduced), it is difficult for the network to make accurate predictions using new data (i.e., when the network is deployed). See overfitting for more details.

One way to combat this problem is to split the data into two (or three) subsets: a training sample, a testing sample, and a validation sample. These samples can then be used to 1) train the network, 2) verify (test) the performance of the networks while under training, and 3) perform n final validation test to determine how well the network predicts "new" data that was neither used to train the model or to test its performance when being trained.

In SANN, the assignment of the cases to the subsets can be executed randomly or based upon a special subset variable in the data set.

Sampling Method.

Random sampling.

Random sample sizes. Select this option button to specify that STATISTICA will randomly assign cases to subsets based on specified percentages with the total percentage summing to no more than 100. If you do not want to split the data into subsets, simply set the value of Test (%) and Validation (%) to zero. Note, however, that the use of at least one hold out sample (test sample) is strongly recommended to aid in training the neural network models. Also note that the sum of the sample percentages can be less than 100. This can be the case if the number of the data cases present in the data set is large. Training of neural networks on large data sets can be time consuming, and the random omission of cases from a large data set can help in reducing the computation time while also producing good models (models that performs well on real data) provided you include a reasonable percentage of data cases for the analysis.

Train (%). Specify the percent of valid cases to use in the training sample. Must be larger than 0 and smaller than or equal to 100.

Test (%). Use this option to randomly assign cases to the test sample. Specify here the percentage of cases to use. To select no test sample, simply enter 0 (not recommended).

Validation (%). Use this option to randomly assign cases to the validation sample. Specify here the percentage of cases to use. To select no validation sample, simply enter 0 (not recommended).

Seed for sampling. The positive integer value entered in the Seed for sampling box is used as the seed for a random number generator that produces the random samples from the data. Starting from the same seed will yield the same sample. If you want to create a different data sample, change the seed value.

Subset variable.

Sampling variable. Select the Sampling variable option button when your data set contains variables that indicate to which sample (Training, Testing, Validation) each case belongs. You will then need to specify a spreadsheet variable and codes to identify which cases are used for the various samples.

Training sample. Click the Training sample button to display the Sampling variable dialog box, which is used to select a sample identifier variable and a code for that variable that uniquely identifies the cases to be used in the training sample. After you have specified the sample identifier and code, the variable name and code will be displayed adjacent to the button.

Testing sample. Click the Testing sample button to display the Sampling variable dialog box, which is used to select a sample identifier variable and a code for that variable that uniquely identifies the cases to be used in the testing sample. After you have specified the sample identifier and code, the variable name and code will be displayed adjacent to the button.

Validation sample. Click the Validation sample button to display the Sampling variable dialog box, which is used to select a sample identifier variable and a code for that variable that uniquely identifies the cases to be used in the validation sample. After you have specified the sample identifier and code, the variable name and code will be displayed adjacent to the button.