Example 1: Correspondence
Analysis and Supplementary Points
This example is based on a fictitious data set presented in Greenacre
(1984, p. 55) to illustrate how to interpret the results of a correspondence
analysis. This data set is also discussed in the Introductory
Overview. In this example, the different formats of data files accepted
by the Correspondence
Analysis module will be illustrated, and the typical results of correspondence
analysis will be explained (see also Computational
Details). Also, the use of supplementary
points for aiding in the interpretation of results will be demonstrated.
Open the Smoking.sta data file:
Ribbon
bar. Select the Home tab.
In the File group, click the
Open arrow and select Open
Examples to display the Open
a STATISTICA Data File dialog box. Smoking.sta
is located in the Datasets folder.
Classic
menus. From the File menu,
select Open Examples to display
the Open a STATISTICA Data File
dialog box. The data file is located in the Datasets
folder.
This file contains the frequency table, as presented in Greenacre (1984,
p. 55).
Formats of data files.
The Correspondence
Analysis module provides great flexibility with regard to the
permissible formats of input data. For example, in addition to the raw
frequency table as contained in the file Smoking,
you could also specify the twoway table by including in the data file
two grouping variables (one for the Employee
group, another for the Smoking
category). This format for the table is illustrated in the example data
file Smoking2.sta.
Finally, you can analyze raw data that are not pretabulated. The data
in the example file Smoking3.sta
are organized in this manner, that is, it only contains two variables
(Employee and Smoking)
with codes to indicate to which group each case belongs; there are a total
of 193 cases in that file.
Specifying the analysis.
Start Correspondence Analysis:
Ribbon bar. Select the Statistics tab. In the Advanced/Multivariate
group, click Mult/Exploratory
and from the menu, select Correspondence
to display the Correspondence Analysis (CA): Table Specifications
Startup Panel.
Classic menus. From the Statistics 
Multivariate Exploratory Techniques submenu, select Correspondence Analysis to
display the Correspondence Analysis
(CA): Table Specifications Startup Panel.
In this example, the data file contains frequencies without grouping
variables; therefore, select the Frequencies
w/out grouping vars option button under Input
on the Correspondence Analysis (CA) tab.
[If you want to use the file Smoking2.sta,
select the Frequencies with grouping
variables option button; to use the file Smoking3.sta,
select the Raw data (requires
tabulation) option button.]
Next select the variables. Click the Variables
with frequencies button to display the standard variable
selection dialog box. Select all variables, and then click the OK button.
Note that when you use this data file format (i.e., the input is a tabulated
frequency table), STATISTICA
will interpret the selected variables as the columns of the table to be
analyzed, and the cases as the rows of the table. Since the data in file
Smoking.sta are arranged in that
manner, click the OK button in
the Startup Panel to perform the correspondence analysis. The Correspondence Analysis Results dialog
box is displayed.
Reviewing the results.
Eigenvalues. If you are not
familiar with the correspondence analysis technique and the most important
statistics that are customarily computed, you may want to review the Introductory
Overview at this point. To reiterate, if you considered the relative
row frequencies as coordinates in a space consisting of as many dimensions
as there are columns, and the relative column frequencies as coordinates
in a space consisting of as many dimensions as there are rows, then the
main goal of the analysis is to reconstruct the distances between the
row points and to reconstruct the distances between the column points,
in a space defined by as few dimensions as possible.
First, click the Eigenvalues
button on the Advanced tab to produce the spreadsheet
that contains information about the number of dimensions that are necessary
to reconstruct the information in the table.
The first column shows the Number
of dimensions; a maximum of three dimensions can be extracted,
in which case the (relative) frequency table can be reconstructed exactly.
The Singular Values are computed
by the socalled generalized singular value decomposition of the table
of relative frequencies (see Computational
Details). The Eigenvalues
are the squared Singular Values,
and they will sum to the Total Inertia,
which is listed in the header of the spreadsheet as .08519.
The total inertia is defined as the Chisquare
value (16.442) divided by the
total number of cases (193). Thus, as discussed in the Introductory
Overview, the correspondence analysis can also be considered to be
a decomposition of the total Chisquare
value, in much the same way that principal components analysis (see Factor
Analysis) decomposes the total variance/covariance matrix of
continuous variables.
As you can see, the dimensions are computed so that the first dimension
extracts the most information (i.e., has the highest eigenvalue), the
next dimension extracts the second most information, and so on (see also
Computational
details). The first dimension in this case extracts 87.76%
of the total inertia. The inclusion of the second dimension increases
the "explained" inertia to 99.51%.
Note that on the Quick tab and Options tab, there are options
under Number of dimensions for
selecting the number of dimensions to retain in the analysis. You can
either directly request a certain Number
of dimensions, or allow STATISTICA
to determine the number of dimensions based on the respective userdefined
value for the Cumulative contribution
to inertia. As described in the Introductory Overview, correspondence
analysis is mostly a descriptive method, rather than a method for hypothesis
testing. Therefore, there are no fixed guidelines as to how to decide
on the number of dimensions to interpret. In this case, it is clear that
the first two dimensions will explain practically the total inertia for
the table.
Thus, accept the default 2
dimensions, and click the Row and column
coordinates button on the Advanced tab.
Reviewing the quality and inertias
of row and column points. Two spreadsheets will be displayed; one
for the row coordinates and one for the column coordinates.
The statistics reported in these spreadsheets are discussed in the Introductory
Overview. First look at the Quality
of the points. The Quality of a point is defined as the ratio of the squared
distance of the point from the origin in the chosen number of dimensions,
over the squared distance from the origin in the space defined by the
maximum number of dimensions (remember that the metric here is Chisquare,
as described in the Introductory Overview). By analogy to Factor Analysis,
the quality of a point is similar in its interpretation to the communality
for a variable in factor analysis. As you can see, both the row and column
points are represented quite well in the twodimensional solution; the
quality for all points is .89
or higher.
The Relative inertia
values pertain to the proportion of the total inertia "accounted
for" by the respective point. Note that a point may be well represented
in a particular solution, but not contribute much to the total inertia.
From the spreadsheets shown above one can see that the row that contributes
most to the overall inertia is that representing the Senior
Employees, and the column that contributes most is that representing
the None smokers.
The quality for each point, due to each dimension can be found in the
columns labeled Cosine2.
The Cosine2
values summed across the two dimensions is equal to the total Quality
value. The relative contribution of each point to the inertia
for each dimension (remember that the Eigenvalues
represent the inertias associated with each dimension) is also shown in
the spreadsheets above.
Standardization of row and column coordinates.
There are several options available on the Options
tab for standardizing the row and column coordinates. Note
that the interpretation of the row and column coordinates depends on the
method of standardization that is chosen (see also the Introductory
Overview); however, the quality of representation and relative inertia
values shown in the spreadsheets above are not affected by the chosen
method of standardization.
The coordinates can be computed based either on the matrix of relative
row frequencies (Row profiles
standardization; the analysis is based on the socalled row profile matrix,
where the sum of all relative frequencies within each row, across the
columns, sums to 1.0), or the relative column frequencies (Column
profiles standardization; the analysis is based on the socalled
column profile matrix, where the sum of all relative frequencies within
each column, across the rows, sums to 1.0). In most cases, the Row
& column profiles standardization is most appropriate (the
default). In that case the Euclidean distances between the row points,
and the distances between the column points can be interpreted in a meaningful
manner (i.e., the distances between the points are Chisquare
distances; see the Introductory
Overview). However, note that the distances between the row and column
points have no meaningful interpretation, regardless of standardization.
Reviewing the row and column coordinates.
The best way to quickly review the row and column coordinates is to plot
them. On the Advanced tab, click the Row
& col.  2D button under
Plots of coordinates. A 2D
scatterplot will be displayed, simultaneously showing the row and
column points in the two dimensions (see also Greenacre, 1984, p. 66).
To reiterate, direct comparisons between row and column points are not
meaningful. However, you can make meaningful interpretations of the general
locations of row and column points, and their relations within each type
of point. For example, if you review the 2D graph of the row and column
points, you can see that the first (horizontal) dimension, which "accounts
for" most of the inertia (and is, therefore, the most "important"
dimension, explaining most of the differences between the patterns of
relative frequencies in the rows of the table, and in the columns of the
table), is characterized by None
smokers on the left, and Light,
Medium, and Heavy
smokers to the right; the row points that are farthest to the left on
this axis are the Senior Employees
and Secretaries. This would suggest
that much of the total inertia is due to the difference between nonsmokers
and smokers, and that there are relatively more nonsmokers among Senior Employees and Secretaries.
Reviewing tables of relative frequencies.
You can easily verify this interpretation by reviewing the tables of relative
frequencies. On the Review tab, click the Row
percentages button and then the Column
percentages button.
The relative row and column frequencies shown in these spreadsheets
support the interpretation of the first dimension: There are a relatively
large percentages of None smokers
among Senior Employees and Secretaries. This makes the respective
row profiles in the table of relative row frequencies, and the respective
column profile (None) in the
table of relative column frequencies different from all the others.
Supplementary
points. An important aspect of correspondence analysis is to
represent row and/or points that were not part of the original analysis
in the same coordinate system as the regular points (see also the Introductory
Overview). Greenacre (1984, Table 3.5) provides an example of this
procedure, in the context of this data set. Specifically, suppose you
had available information about the national averages concerning the different
categories of smoking, and information about the number of employees in
each staff group that did or did not consume alcohol.

Smoking Category 

None 
Light 
Medium 
Heavy 
National
Average 
42% 
29% 
20% 
9% 

Alcohol 
Staff Group 
Yes 
No 
Senior Managers 
0 
11 
Junior Managers 
1 
17 
Senior Employees 
5 
46 
Junior Employees 
10 
78 
Secretaries 
7 
18 
Specifying a supplementary row.
On the Supplementary points tab, click
the Add row points button. The
Supplementary
Row Points dialog box is displayed where you can specify the supplementary
row points. Remember that in row profile standardization, the analysis
is performed on the relative row frequencies, which will sum to 1.0; thus,
it does not matter whether you enter 42
or .42, i.e., percentages or
proportions, the results will be the same either way.
To enter a supplementary row, first type a name or label for the row
into the first column of the spreadsheet (e.g., type Average).
Next type in the values 42, 29, 20,
and 9 under the respective column
headers None, Light,
Medium, and Heavy.
To accept these values, exit the dialog box by clicking the OK
button; if you exit the dialog box by closing it or clicking the Cancel button, your entries will be
discarded.
Specifying supplementary columns.
Next, click the Add column points
button, and enter the supplementary column frequencies shown below.
Click the OK button.
Reviewing statistics for supplementary
points. After specifying the supplementary rows, whenever you select
any of the plots of the coordinates, or when you select the spreadsheets
of row and column coordinates, the resulting displays will incorporate
the results for the supplementary rows and columns. For example, shown
below are the coordinate values and related statistics, along with the
statistics for the standard row and column points reviewed earlier, that
are displayed after you click the Row
and column coordinates button on the Advanced tab.
The interpretation of these statistics is the same as that for the points
that were used to perform the analysis (see also the Introductory
Overview). It appears that the twodimensional solution represents
the new row point Average (i.e.,
national average) very well (the Quality
is .7613). The new column points
are not quite as well represented, however, still over 40% of the total
squared (weighted) distance of these points from the origin in the space
defined by the maximum number of dimensions is "accounted for"
by the twofactor solution (the Quality
is equal to .4386 for both supplementary
column points).
At this point, you may want to try to enter as supplementary row and
column points the respective column and row totals for the entire table.
You will see that those points will be represented by coordinates that
are equal to 0 for all dimensions. This illustrates that the space defined
by the two dimensions is weighted by the respective column and row totals,
which define the origin of the coordinate system. Thus, you could interpret
the distances of the points from the origin as (Chisquare)
distances from the respective column and row totals.
Plots with supplementary
points. Now produce the combined 2D scatterplot again, for
both the row and column points. Click the Row
& col.  2D button under
Plots of coordinates on the Advanced tab.
The supplementary row point for the national Average
will be plotted on the left side of the origin for the horizontal axis
(the coordinate value is .2584;
see the first table shown above). Thus, one may infer that there are relatively
more None smokers on average
in the nation than there are in the current sample.
The supplementary column points Alcohol
Yes and Alcohol No approximately
line up along the second axis, which also appears to distinguish between
different degrees of smoking, i.e., Light,
Medium, and Heavy
(as mentioned above, the first axis appears to distinguish between None smokers and smokers). Thus, there
is some indication that Heavy
smokers are also more likely to consume alcohol (specifically, the pattern
of frequencies across the staff groups for Alcohol
is more similar to the pattern of frequencies for the Heavy
and Medium smokers). However,
remember that correspondence analysis is primarily a descriptive and/or
exploratory
technique to represent categorical data in graphical displays, and
no claims of statistical significance are implied (see the Introductory
Overview; see also Elementary
Concepts in Statistics).
See also, Correspondence
Analysis  Index.