Stratified Random Sampling

In general, random sampling is the process of randomly selecting observations from a population to create a subsample that "represents" the observations in that population (see Kish, 1965; see also Probability Sampling, Simple Random Sampling, and EPSEM Samples; see also Representative Sample for a brief exploration of this often misunderstood notion). In stratified sampling, we usually apply specific (identical or different) sampling fractions to different groups (strata) in the population to draw the sample. In STATISTICA, we can draw stratified random samples by using the Random Sampling options.

Over-sampling particular strata to over-represent rare events. In some predictive data mining applications, it is often necessary to apply stratified sampling to systematically over-sample (apply a greater sampling fraction) to particular "rare events" of interest. For example, in catalog retailing, the response rate to particular catalog offers can be below 1%, and when analyzing historical data (from prior campaigns) to build a model for targeting potential customers more successfully, it is desirable to over-sample past respondents (i.e., the "rare" respondents who ordered from the catalog); we can then apply the various model building techniques for classification (see Data Mining) to a sample consisting of approximately 50% responders and 50% non-responders. Otherwise, if we were to draw a simple random sample for the analysis (with 1% of responders), then practically all model building techniques would likely predict a simple "no-response" for all cases, and would be (trivially) correct in 99% of the cases.