Canonical Analysis - Assumptions

The following discussion provides a list of the most important assumptions of canonical correlation analysis and the major threats to the reliability and validity of results.

Distributions. The tests of significance of the canonical correlations are based on the assumption that the distributions of the variables in the population (from which the sample was drawn) are multivariate normal. STATISTICA allows you to graph the variables in the analysis, that is, to produce frequency histograms with the normal curve superimposed, and scatterplots. Little is known about the effects of violations of the multivariate normality assumption. However, with a sufficiently large sample size (see below) the results from canonical correlation analysis are usually quite robust.

Sample sizes. Stevens (1986) provides a very thorough discussion of the sample sizes that should be used in order to obtain reliable results. As mentioned earlier, if there are strong canonical correlations in the data (e.g., R > .7), then even relatively small samples (e.g., n = 50) will detect them most of the time. However, in order to arrive at reliable estimates of the canonical factor loadings (for interpretation), Stevens recommends that there should be at least 20 times as many cases as variables in the analysis, if one wants to interpret the most significant canonical root only. To arrive at reliable estimates for two canonical roots, Barcikowski and Stevens (1975) recommend, based on a Monte Carlo study, to include 40 to 60 times as many cases as variables.

Outliers. Outliers can greatly affect the magnitudes of correlation coefficients (see Basic Statistics and Tables). Since canonical correlation analysis is based on (computed from) correlation coefficients, they can also seriously affect the canonical correlations. Of course, the larger the sample size, the smaller is the impact of one or two outliers. However, it is a good idea to examine the various scatterplots available in the Canonical Correlation module to detect possible outliers. Note that scatterplots can be produced not only for variables, but also for canonical variates.

Matrix ill-conditioning. One assumption is that the variables in the two sets should not be completely redundant. For example, if you included the same variable twice in one of the sets, then it is not clear how to assign different weights to each of them. Computationally, such complete redundancies will "upset" the canonical correlation analysis. When there are perfect correlations in the correlation matrix, or if any of the multiple correlations between one variable and the others is perfect (R = 1.0), then the correlation matrix cannot be inverted, and the computations for the canonical analysis cannot be performed. Such correlation matrices are said to be ill-conditioned.

Once again, this assumption appears trivial on the surface; however, it often is "almost" violated when the analysis includes very many highly redundant measures, as is often the case when analyzing questionnaire responses. In extreme cases, the program will "refuse" to perform the analysis and issues a respective error message ("matrix ill-conditioned...").

Note that for large (multi-variable) problems, when there are few cases, the General Partial Least Squares Models (PLS) module can also be used to build a model for the data.