Correspondence Analysis - Supplementary Points

The Introductory Overview explains how to interpret the coordinates and related statistics computed in a correspondence analysis. An important aid in the interpretation of the results from a correspondence analysis is to include supplementary row or column points that were not used to perform the original analyses. The Correspondence Analysis module allows you to add both row and column points, and to plot their coordinates together with the regular row and column points in a combined graph (see also Computational Details).

For example, consider the following results, which are based on the example given in the Introductory Overview (based on Greenacre, 1984).

Row Name

Dim. 1

Dim. 2

(1) Senior Managers

-.065768

.193737

(2) Junior Managers

.258958

.243305

(3) Senior Employees

-.380595

.010660

(4) Junior Employees

.232952

-.057744

(5) Secretaries

-.201089

-.078911

National Average

-.258368

-.117648

The table above shows the coordinate values (for two dimensions) computed for a frequency table of different types of employees by type of smoking habit. The row labeled National Average contains the coordinate values for the supplementary point, which is the national average (percentages) for the different smoking categories (which make up the columns of the table; those fictitious percentages reported in Greenacre (1984) are: nonsmokers: 42%, light smokers: 29%, medium smokers, 20%; heavy smokers: 9%). If you plotted these coordinates in a two-dimensional scatterplot, along with the column coordinates, it would be apparent that the National Average supplementary row point is plotted close to the point representing the Secretaries group, and on the same side of the horizontal axis (first dimension) as the Nonsmokers column point. If you refer back to the original two-way table shown in the Introductory Overview, this finding is consistent with the entries in the table of row frequencies; there are relatively more nonsmokers among the Secretaries, and National Average. Put another way, the sample represented in the original frequency table contains more smokers than the national average.

While this type of information could have been easily gleaned from the original frequency table (that was used as the input to the analysis), in the case of very large tables, such conclusions may not be as obvious.

Quality of representation of supplementary points. Another interesting result for supplementary points concerns the quality of their representation in the chosen number of dimensions (see the Introductory Overview for a more detailed discussion of the concept of quality of representation). To reiterate, the goal of the correspondence analysis is to reproduce the distances between the row or column coordinates (patterns of relative frequencies across the columns or rows, respectively) in a low-dimensional solution. Given such a solution, you may ask whether particular supplementary points of interest can be represented equally well in the final space, that is, whether or not their distances from the other points in the table can also be represented in the chosen numbers of dimensions. Shown below are the summary statistics for the original points, and the supplementary row point National Average, for the two-dimensional solution.

 

Staff Group

 

Quality

Cosine2

Dim.1

Cosine2

Dim.2

(1) Senior Managers

.892568

.092232

.800336

(2) Junior Managers

.991082

.526400

.464682

(3) Senior Employees

.999817

.999033

.000784

(4) Junior Employees

.999810

.941934

.057876

(5) Secretaries

.998603

.865346

.133257

National Average

.761324

.630578

.130746

The statistics reported in the table above are discussed in the Introductory Overview. In short, the Quality of a row or column point is defined as the ratio of the squared distance of the point from the origin in the chosen number of dimensions, over the squared distance from the origin in the space defined by the maximum number of dimensions (remember that the metric here is Chi-square, as described in the Introductory Overview). In a sense, the overall quality is the "proportion of squared distance-from-the-overall-centroid accounted for." The supplementary row point National Average has a Quality of .76, indicating that it is reasonably well represented in the two-dimensional solution. The Cosine² statistic is the Quality "accounted for" by the respective row point, by the respective dimension (the sum of the Cosine2 values over the respective number of dimensions is equal to the total Quality, see also the Introductory Overview).

See also, Exploratory Data Analysis and Data Mining Techniques.