Data Preparation Phase

Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in predictive data mining.

In Data Mining, the input data are often "noisy" - containing many errors and sometimes information in unstructured form (e.g., in Text Mining). For example, suppose you want to analyze a large database of information collected on-line via the Web, based on voluntary responses of persons reviewing your Web site (e.g., potential customers of a Web-based retailer, who filled out suggestion forms). In this instance, it is very important to first verify and "clean" the data in a data preparation phase before applying any analytic procedures. For example, some individuals might enter clearly faulty information (e.g., age = 300), either by mistake or intentionally. If those types of data errors are not detected prior to the analysis phase of the data mining project, they can greatly bias the result and potentially cause unjustified conclusions. Typically, during the data preparation phase, the data analyst applies "filters" to the data to verify correct data ranges, and to delete impossible co-occurrences of  values (e.g., Age=5; Retired=Yes).