Association Rules - Introductory Overview

The goal of association rules techniques is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects and in its subcategory, text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. The techniques make it possible for analysts and researchers to uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z."  The implementation of the so-called a priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) in Statistica enable you to process huge data sets rapidly for such associations, based on predefined "threshold" values for detection.

How association rules work. The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose you are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer or may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, you want to learn which books are likely to be purchased by a customer who you know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be familiar with the results of these types of analyses if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of "check-out," based on rules such as "customers who buy book title A are also likely to purchase book title B," and so on.

Unique data analysis requirements. In principle, Statistica already contains all the tools necessary to analyze data as described above, and to compute the results (tables) of interest. For example, Crosstabulation tables, and in particular Multiple Response tables in Basic Statistics can be used to analyze data of this kind (however, see the Technical Note on Coding of Multiple Response Variables). However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then the Basic Statistics tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple bookstore example discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a priori algorithm implemented in Statistica Association Rules does not only automatically detect the relationships ("crosstabulation tables") that are important (i.e., crosstabulation tables that are not sparse, not containing mostly zeros), but also determine the factorial degree of the tables that contain the important association rules.

To summarize, you can use the Association Rules module of Statistica to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct crosstabulation tables without the need to specify the number of dimensions for the tables or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.