Multiple Correspondence
Analysis (MCA)
Multiple correspondence analysis (MCA) can be considered to be an extension
of simple correspondence analysis to more than two variables. For an introductory
overview of simple correspondence analysis, refer to the Introductory
Overview. Multiple correspondence analysis is a simple correspondence
analysis carried out on an indicator (or design) matrix with cases as
rows and categories of variables as columns. Actually, one usually analyzes
the inner product of such a matrix, called the Burt table in an MCA; this will be discussed
later. However, to clarify the interpretation of the results from a multiple
correspondence analysis, it is easier to discuss the simple correspondence
analysis of an indicator or design matrix.
Indicator
or design matrix.
Consider again the simple twoway table presented in the
Introductory Overview:

Smoking Category 

Staff
Group 
(1)
None 
(2)
Light 
(3)
Medium 
(4)
Heavy 
Row
Totals 
(1)
Senior Managers 
4 
2 
3 
2 
11 
(2)
Junior Managers 
4 
3 
7 
4 
18 
(3)
Senior Employees 
25 
10 
12 
4 
51 
(4)
Junior Employees 
18 
24 
33 
13 
88 
(5) Secretaries 
10 
6 
7 
2 
25 
Column Totals 
61 
45 
62 
25 
193 
Suppose you had entered the data for this table in the following manner,
as an indicator or design matrix:

Staff Group 
Smoking 
Case
Number 
Senior
Manager 
Junior
Manager 
Senior
Employee 
Junior
Employee 
Secretary 
None 
Light 
Medium 
Heavy 
1 
1 
0 
0 
0 
0 
1 
0 
0 
0 
2 
1 
0 
0 
0 
0 
1 
0 
0 
0 
3 
1 
0 
0 
0 
0 
1 
0 
0 
0 
4 
1 
0 
0 
0 
0 
1 
0 
0 
0 
5 
1 
0 
0 
0 
0 
0 
1 
0 
0 
... 
. 
. 
. 
. 
. 
. 
. 
. 

... 
. 
. 
. 
. 
. 
. 
. 
. 
. 
... 
. 
. 
. 
. 
. 
. 
. 
. 
. 
191 
0 
0 
0 
0 
1 
0 
0 
1 
0 
192 
0 
0 
0 
0 
1 
0 
0 
0 
1 
193 
0 
0 
0 
0 
1 
0 
0 
0 
1 
Each one of the 193 total cases in the table is represented by one case
in this data file. For each case a 1
is entered into the category where the respective case "belongs,"
and a 0 otherwise. For example,
case 1 represents a Senior
Manager who is a None smoker.
As can be seen in the table above, there are a total of 4
such cases in the twoway table, and thus there will be four cases
like this in the indicator matrix. In all, there will be 193
cases in the indicator or design matrix.
Analyzing
the design matrix. If you now analyzed this data file (design
or indicator matrix) shown above as if it were a twoway frequency table,
the results of the correspondence analysis would provide column coordinates
that would allow you to relate the different categories to each other,
based on the distances between the row points, i.e., between the individual
cases. In fact, the twodimensional display you would obtain for the column
coordinates would look very similar to the combined display for row and
column coordinates, if you had performed the simple correspondence analysis
on the twoway frequency table (note that the metric will be different,
but the relative positions of the points will be very similar).
More
than two variables.
The approach to analyzing categorical data outlined above can easily
be extended to more than two categorical variables. For example, the indicator
or design matrix could contain two additional variables Male
and Female, again coded
0 and 1,
to indicate the subjects' gender; and three variables could be added to
indicate to which one of three age groups a case belongs. Thus, in the
final display, one could represent the relationships (similarities) between
Gender, Age,
Smoking habits, and Occupation
(Staff Groups).
Fuzzy
coding. It is not necessary that each case is assigned exclusively
to only one category of each categorical variable. Rather than the 0or1
coding scheme, one could enter probabilities for membership in
a category, or some other measure that represents a fuzzy rule for group
membership. Greenacre (1984) discusses different types of coding schemes
of this kind. For example, suppose in the example design matrix shown
earlier, you had missing data for a few cases regarding their smoking
habits. Instead of discarding those cases entirely from the analysis (or
creating a new category Missing data),
you could assign to the different smoking categories proportions (which
should add to 1.0) to represent the probabilities that the respective
case belongs to the respective category (e.g., you could enter proportions
based on your knowledge about estimates for the national averages for
the different categories).
Interpretation
of coordinates and other results. To reiterate, the results
of a multiple correspondence analysis are identical to the results you
would obtain for the column coordinates from a simple correspondence analysis
of the design or indicator matrix. Therefore, the interpretation of coordinate
values, quality values, cosine2's and other statistics reported
as the results from a multiple correspondence analysis can be interpreted
in the same manner as described in the context of the simple correspondence
analysis (see Introductory Overview), however, these statistics pertain
to the total inertia associated
with the entire design matrix.
Supplementary
column points and "multiple regression" for categorical variables.
Another application of the analysis of design matrices via correspondence
analysis techniques is that it allows you to perform the equivalent of
a Multiple
Regression for categorical variables, by adding supplementary
columns to the design matrix. For example, suppose you added to the design
matrix shown earlier two columns to indicate whether or not the respective
subject had or had not been ill over the past year (i.e., you could add
one column Ill and another column
Not ill, and again enter 0's and 1's
to indicate each subject's health status). If, in a simple correspondence
analysis of the design matrix, you added those columns as supplementary
columns to the analysis, then (1) the summary statistics for the quality of representation (see the
Introductory
Overview) for those columns would give you an indication of how well
you can "explain" illness as a function of the other variables
in the design matrix, and (2) the display of the column points in the
final coordinate system would provide an indication of the nature (e.g.,
direction) of the relationships between the columns in the design matrix
and the column points indicating illness; this technique (adding supplementary
points to an MCA analysis) is also sometimes called predictive
mapping.
The
Burt table.
The actual computations in multiple correspondence analysis are
not performed on a design or indicator matrix (which, potentially, may
be very large if there are many cases), but on the inner product of this
matrix; this matrix is also called the Burt
matrix. With frequency tables, this amounts to tabulating the stacked
categories against each other; for example the Burt table for the twoway frequency
table presented earlier would look like this.

Employee 
Smoking 
(1) 
(2) 
(3) 
(4) 
(5) 
(1) 
(2) 
(3) 
(4) 
(1)
Senior Managers 
11 
0 
0 
0 
0 
4 
2 
3 
2 
(2)
Junior Managers 
0 
18 
0 
0 
0 
4 
3 
7 
4 
(3)
Senior Employees 
0 
0 
51 
0 
0 
25 
10 
12 
4 
(4)
Junior Employees 
0 
0 
0 
88 
0 
18 
24 
33 
13 
(5)
Secretaries 
0 
0 
0 
0 
25 
10 
6 
7 
2 
(1)
Smoking:None 
4 
4 
25 
18 
10 
61 
0 
0 
0 
(2)
Smoking:Light 
2 
3 
10 
24 
6 
0 
45 
0 
0 
(3)
Smoking:Medium 
3 
7 
12 
33 
7 
0 
0 
62 
0 
(4) Smoking:Heavy 
2 
4 
4 
13 
2 
0 
0 
0 
25 
The Burt
table has a clearly defined structure. In the case of two categorical
variables (shown above), it consists of 4 partitions: (1) the crosstabulation
of variable Employee against
itself, (2) the crosstabulation of variable Employee
against variable Smoking,
(3), the crosstabulation of variable Smoking
against variable Employee,
and (4) the crosstabulation of variable Smoking
against itself. Note that the matrix is symmetrical, and that the
sum of the diagonal elements in each partition representing the crosstabulation
of a variable against itself must be the same (e.g., there were a total
of 193 observations in the present example, and hence, the diagonal elements
in the crosstabulation tables of variable Employee
against itself, and Smoking against
itself must also be equal to 193).
Note that the offdiagonal elements in the partitions representing the
crosstabulations of a variable against itself are equal to 0
in the table shown above. However, this is not necessarily always
the case, for example, when the Burt
table was derived from a design or indicator matrix that included
fuzzy coding of category membership (see above).
Creating
a Burt table in STATISTICA. The Correspondence
Analysis module will allow you to use a Burt
table directly for input into the analysis. The module can also automatically
create a Burt table from variables coded
in the standard manner, that is, if you included in your data file grouping
variables to indicate the group membership of each case (e.g., you included
a variable Gender, with the two
possible values Male and Female). Thus, in most cases there
is no need to recode your data in any special way (e.g., into a design
or indicator matrix), and you can analyze categorical variables coded
in a manner that also allows you to use, for example, the LogLinear
module, or Basic Statistics
module. Please refer to the Correspondence
Analysis: Table Specifications dialog for additional details
on the different ways in which data can be formatted for use with the
Correspondence
Analysis module.
Creating
customized Burt table. In case your analysis
requires you to employ some customized fuzzy coding scheme for several
categorical variables, it is very easy to create a Burt
table via STATISTICA Visual BASIC;
that table can then be displayed in a spreadsheet and saved as a data
file, for subsequent analysis with the Correspondence
Analysis module (remember
that the Burt
table is simply the inner product of the design or indicator matrix,
e.g., if matrix X is the design
or indicator matrix, then matrix product X'X
is a Burt
table).
See also, Exploratory
Data Analysis and Data Mining Techniques.