The goal of association rules techniques is to detect relationships
or associations between specific values of categorical variables in large
data sets. This is a common task in many data
mining projects and in its subcategory, text
mining. These powerful exploratory techniques have a wide range of
applications in many areas of business practice and also research - from
the analysis of consumer preferences or human resource management, to
the history of language. The techniques make it possible for analysts
and researchers to uncover hidden patterns in large data sets, such as
"customers who order product A
often also order product B or
C" or "employees who
said positive things about initiative X
also frequently complain about issue Y
but are happy with issue Z."

How association rules
work. The usefulness of this technique to address unique data mining
problems is best illustrated in a simple example. Suppose you are collecting
data at the check-out cash registers at a large book store. Each customer
transaction is logged in a database, and consists of the titles of the
books purchased by the respective customer, perhaps additional magazine
titles and other gift items that were purchased, and so on. Hence, each
record in the database will represent one customer (transaction), and
may consist of a single book purchased by that customer or may consist
of many (perhaps hundreds of) different items that were purchased, arranged
in an arbitrary order depending on the order in which the different items
(books, magazines, and so on) came down the conveyor belt at the cash
register. The purpose of the analysis is to find associations between
the items that were purchased, i.e., to derive association rules that
identify the items and co-occurrences of different items that appear with
the greatest (

Unique data analysis requirements. In principle, Statistica already contains all the tools necessary to analyze data as described above, and to compute the results (tables) of interest. For example, Crosstabulation tables, and in particular Multiple Response tables in Basic Statistics can be used to analyze data of this kind (however, see the Technical Note on Coding of Multiple Response Variables). However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then the Basic Statistics tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple bookstore example discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a priori algorithm implemented in Statistica Association Rules does not only automatically detect the relationships ("crosstabulation tables") that are important (i.e., crosstabulation tables that are not sparse, not containing mostly zeros), but also determine the factorial degree of the tables that contain the important association rules.

To summarize, you can use the Association Rules module of Statistica to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct crosstabulation tables without the need to specify the number of dimensions for the tables or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.