Random Sub-Sampling in Data Mining

When mining huge data sets with many millions of observations, it is neither practical nor desirable to process all cases (although efficient incremental learning algorithms exist in STATISTICA Data Miner to perform predictive data mining using all observations in the data set). For example, by properly sampling only 100 observations (from millions of observations) you can compute a very reliable estimate of the mean. One of the rules of statistical sampling that is often not intuitively understood by untrained "observers" is the fact that the reliability and validity of results depend, among many other things, on the size of a random sample, and not on the size of the population from which it is taken. In other words, the mean estimated from 100 randomly sampled observations is as accurate (i.e., falls within the same confidence limits) regardless of whether the sample was taken from 1,000 cases or 100 billion cases. Put another way, given a certain (reasonable) degree of accuracy required, there is absolutely no need to process and include all observations in the final computations (for estimating the mean, fitting models, etc.).

STATISTICA Data Miner contains nodes in the Data Cleaning and Filtering folder to draw a random sample from the original input data (database connection). Note that STATISTICA employs a very high-quality (validated) random number generator algorithm that ensures that the selection of observations will not be biased.

See also Random Numbers, DIEHARD Test, and Data Mining.