Multiple Correspondence Analysis (MCA)

Multiple correspondence analysis (MCA) can be considered to be an extension of simple correspondence analysis to more than two variables. For an introductory overview of simple correspondence analysis, refer to the Introductory Overview. Multiple correspondence analysis is a simple correspondence analysis carried out on an indicator (or design) matrix with cases as rows and categories of variables as columns. Actually, one usually analyzes the inner product of such a matrix, called the Burt table in an MCA; this will be discussed later. However, to clarify the interpretation of the results from a multiple correspondence analysis, it is easier to discuss the simple correspondence analysis of an indicator or design matrix.

Indicator or design matrix. Consider again the simple two-way table presented in the Introductory Overview:

 

Smoking Category

 

Staff

Group

(1)

None

(2)

Light

(3)

Medium

(4)

Heavy

Row

Totals

(1) Senior Managers

4

2

3

 2

  11

(2) Junior Managers

 4

 3

7

4

18

(3) Senior Employees

25

10

12

4

 51

(4) Junior Employees

18

24

33

13

88

(5) Secretaries

10

6

 7

2

25

Column Totals

61

45

62

25

193

Suppose you had entered the data for this table in the following manner, as an indicator or design matrix:

 

Staff Group

Smoking

Case

Number

Senior

Manager

Junior

Manager

Senior

Employee

Junior

Employee

 

Secretary

 

None

 

Light

 

Medium

 

Heavy

1

1

0

0

0

0

1

0

0

0

2

1

0

0

0

0

1

0

0

0

3

1

0

0

0

0

1

0

0

0

4

1

0

0

0

0

1

0

0

0

5

1

0

0

0

0

0

1

0

0

...

.

.

.

.

.

.

.

.

 

...

.

.

.

.

.

.

.

.

.

...

.

.

.

.

.

.

.

.

.

191

0

0

0

0

1

0

0

1

0

192

0

0

0

0

1

0

0

0

1

193

0

0

0

0

1

0

0

0

1

Each one of the 193 total cases in the table is represented by one case in this data file. For each case a 1 is entered into the category where the respective case "belongs," and a 0 otherwise. For example, case 1 represents a Senior Manager who is a None smoker. As can be seen in the table above, there are a total of 4 such cases in the two-way table, and thus there will be four cases like this in the indicator matrix. In all, there will be 193 cases in the indicator or design matrix.

Analyzing the design matrix. If you now analyzed this data file (design or indicator matrix) shown above as if it were a two-way frequency table, the results of the correspondence analysis would provide column coordinates that would allow you to relate the different categories to each other, based on the distances between the row points, i.e., between the individual cases. In fact, the two-dimensional display you would obtain for the column coordinates would look very similar to the combined display for row and column coordinates, if you had performed the simple correspondence analysis on the two-way frequency table (note that the metric will be different, but the relative positions of the points will be very similar).

More than two variables. The approach to analyzing categorical data outlined above can easily be extended to more than two categorical variables. For example, the indicator or design matrix could contain two additional variables Male and Female, again coded 0 and 1, to indicate the subjects' gender; and three variables could be added to indicate to which one of three age groups a case belongs. Thus, in the final display, one could represent the relationships (similarities) between Gender, Age, Smoking habits, and Occupation (Staff Groups).

Fuzzy coding. It is not necessary that each case is assigned exclusively to only one category of each categorical variable. Rather than the 0-or-1 coding scheme, one could enter probabilities for membership in a category, or some other measure that represents a fuzzy rule for group membership. Greenacre (1984) discusses different types of coding schemes of this kind. For example, suppose in the example design matrix shown earlier, you had missing data for a few cases regarding their smoking habits. Instead of discarding those cases entirely from the analysis (or creating a new category Missing data), you could assign to the different smoking categories proportions (which should add to 1.0) to represent the probabilities that the respective case belongs to the respective category (e.g., you could enter proportions based on your knowledge about estimates for the national averages for the different categories).

Interpretation of coordinates and other results. To reiterate, the results of a multiple correspondence analysis are identical to the results you would obtain for the column coordinates from a simple correspondence analysis of the design or indicator matrix. Therefore, the interpretation of coordinate values, quality values, cosine2's and other statistics reported as the results from a multiple correspondence analysis can be interpreted in the same manner as described in the context of the simple correspondence analysis (see Introductory Overview), however, these statistics pertain to the total inertia associated with the entire design matrix.

Supplementary column points and "multiple regression" for categorical variables. Another application of the analysis of design matrices via correspondence analysis techniques is that it allows you to perform the equivalent of a Multiple Regression for categorical variables, by adding supplementary columns to the design matrix. For example, suppose you added to the design matrix shown earlier two columns to indicate whether or not the respective subject had or had not been ill over the past year (i.e., you could add one column Ill and another column Not ill, and again enter 0's and 1's to indicate each subject's health status). If, in a simple correspondence analysis of the design matrix, you added those columns as supplementary columns to the analysis, then (1) the summary statistics for the quality of representation (see the Introductory Overview) for those columns would give you an indication of how well you can "explain" illness as a function of the other variables in the design matrix, and (2) the display of the column points in the final coordinate system would provide an indication of the nature (e.g., direction) of the relationships between the columns in the design matrix and the column points indicating illness; this technique (adding supplementary points to an MCA analysis) is also sometimes called predictive mapping.

The Burt table. The actual computations in multiple correspondence analysis are not performed on a design or indicator matrix (which, potentially, may be very large if there are many cases), but on the inner product of this matrix; this matrix is also called the Burt matrix. With frequency tables, this amounts to tabulating the stacked categories against each other; for example the Burt table for the two-way frequency table presented earlier would look like this.

 

Employee

Smoking

(1)

(2)

(3)

(4)

(5)

(1)

(2)

(3)

(4)

(1) Senior Managers

11

0

0

0

0

4

2

3

2

(2) Junior Managers

0

18

0

0

0

4

3

7

4

(3) Senior Employees

0

0

51

0

0

25

10

12

4

(4) Junior Employees

0

0

0

88

0

18

24

33

13

(5) Secretaries

0

0

0

0

25

10

6

7

2

(1) Smoking:None

4

4

25

18

10

61

0

0

0

(2) Smoking:Light

2

3

10

24

6

0

45

0

0

(3) Smoking:Medium

3

7

12

33

7

0

0

62

0

(4) Smoking:Heavy

2

4

4

13

2

0

0

0

25

The Burt table has a clearly defined structure. In the case of two categorical variables (shown above), it consists of 4 partitions: (1) the crosstabulation of variable Employee against itself, (2) the crosstabulation of variable Employee against variable Smoking, (3), the crosstabulation of variable Smoking against variable Employee, and (4) the crosstabulation of variable Smoking against itself. Note that the matrix is symmetrical, and that the sum of the diagonal elements in each partition representing the crosstabulation of a variable against itself must be the same (e.g., there were a total of 193 observations in the present example, and hence, the diagonal elements in the crosstabulation tables of variable Employee against itself, and Smoking against itself must also be equal to 193).

Note that the off-diagonal elements in the partitions representing the crosstabulations of a variable against itself are equal to 0 in the table shown above. However, this is not necessarily always the case, for example, when the Burt table was derived from a design or indicator matrix that included fuzzy coding of category membership (see above).

Creating a Burt table in STATISTICA. The Correspondence Analysis module will allow you to use a Burt table directly for input into the analysis. The module can also automatically create a Burt table from variables coded in the standard manner, that is, if you included in your data file grouping variables to indicate the group membership of each case (e.g., you included a variable Gender, with the two possible values Male and Female). Thus, in most cases there is no need to recode your data in any special way (e.g., into a design or indicator matrix), and you can analyze categorical variables coded in a manner that also allows you to use, for example, the Log-Linear module, or Basic Statistics module. Please refer to the Correspondence Analysis: Table Specifications dialog for additional details on the different ways in which data can be formatted for use with the Correspondence Analysis module.

Creating customized Burt table. In case your analysis requires you to employ some customized fuzzy coding scheme for several categorical variables, it is very easy to create a Burt table via STATISTICA Visual BASIC; that table can then be displayed in a spreadsheet and saved as a data file, for subsequent analysis with the Correspondence Analysis module (remember that the Burt table is simply the inner product of the design or indicator matrix, e.g., if matrix X is the design or indicator matrix, then matrix product X'X is a Burt table).

See also, Exploratory Data Analysis and Data Mining Techniques.