Example 8: Multiple
This example is based on the data file Poverty.sta
that is included with your STATISTICA
program. Open this data file via the File - Open Examples
menu; it is in the Datasets folder. Refer to Example
7 demonstrating simple regression analysis for a description of the
datafile. See also the Introductory
Overview of Multiple
Regression for a discussion of these methods.
For this example several possible correlates of poverty will be analyzed
and the relative degree to which each predicts the percent of families
below the poverty line in a county will be determined. Thus, you will
treat variable 3 (Pt_Poor) as
the dependent or criterion variable, and the remaining variables will
be treated as continuous predictor variables.
Specifying the analysis.
To perform the analysis, select General Linear Models from the Statistics - Advanced
Linear/Nonlinear Models menu to display the General Linear Models (GLM) Startup Panel.
Select Multiple regression as
the Type of analysis, Quick
specs dialog as the Specification
method, and then click the OK
button to display the GLM Multiple Regression Quick Specs dialog.
Click the Variables button to
display the standard variable
selection dialog. Here, select Pt_Poor
as the Dependent variable list
and the remaining variables as the Predictor
variables, and then click the OK
button to return to the GLM Multiple
Regression Quick Specs dialog.
To view the syntax
program automatically generated from the dialog specifications, click
the Syntax editor button on the GLM Multiple Regression
Quick Specs dialog to display the
Analysis Syntax Editor dialog.
= POP_CHNG N_EMPLD TAX_RATE PT_PHONE PT_RURAL AGE;
= POP_CHNG + N_EMPLD + TAX_RATE + PT_PHONE + PT_RURAL
The remainder of the specifications for this analysis can use the default
specifications, so click the OK (Run)
button on the GLM Analysis Syntax Editor dialog or the
OK button on the GLM Multiple Regression Quick Specs dialog
to perform the analysis.
Regression coefficients. When
Results dialog comes up, click the Coefficients
button on the Summary tab. In order to learn
which of the independent
variables contribute most to the prediction of poverty, examine the
unstandardized regression (or B) coefficients and the standardized regression
(or Beta) coefficients.
The Beta coefficients are the
coefficients you would have obtained had you first standardized all of
your variables to a mean of 0 and a standard deviation of 1. Thus, the
magnitude of these Beta coefficients
allows you to compare the relative contribution of each independent variable
in the prediction of the dependent variable. As is evident in the spreadsheet
shown above, variables Pop_Chng,
Pt_Rural, and N_Empld
are the most important predictors of poverty; of those, only the first
two variables are statistically significant (their 95% confidence interval
limits do not include 0). The regression coefficient for Pop_Chng
is negative; the less the population increased, the greater the number
of families who lived below the poverty level in the respective county.
The regression weight for Pt_Rural
is positive; the greater the percent of rural population, the greater
the poverty level.
Significance of regressor effects.
Click the Univariate results
button to display a spreadsheet containing tests of significance.
As this spreadsheet shows, only the Pop_Chng
and Pt_Rural effects are statistically
significant, p's < .05.
After fitting a regression equation, one should always examine the predicted
and residual scores. For example, extreme outliers
may seriously bias results and lead to erroneous conclusions. From the
Results dialog, click the More results
button and then the Residuals
1 tab to reveal the options for analysis of residuals.
Casewise plot of residuals.
Usually, one should examine the pattern of the raw or standardized residuals
to identify any extreme outliers. For this example, select Standardized
under Resids for default plots.
Then click the Case no. & res.
button to display the following graph with a casewise plot of residuals.
The scale used for the vertical axis of the casewise plot is in terms
of sigma, that is, the standard
deviation of residuals. If one or several cases fall outside of the ±
3 times sigma limits, one should
probably exclude the respective cases (which is easily accomplished via
selection conditions) and run the analysis over to make sure that key
results were not biased by these outliers.
Mahalanobis distances. Most
statistics textbooks devote some discussion to the issue of outliers
and residuals concerning the dependent
variable. However, the role of outliers in the predictor variables
is often overlooked. On the predictor variable side, you have a list of
variables that participate with different weights (the regression coefficients)
in the prediction of the dependent variable. One can think of the independent
variables as defining a multidimensional space in which each observation
can be located. For example, if you had two independent variables with
equal regression coefficients, then you could construct a scatterplot
of those two variables, and place each observation in that plot. You could
then plot one point for the mean on both variables and compute the distances
of each observation from this mean (now called the centroid) in the two-dimensional
space; this is the conceptual idea behind the computation of the Mahalanobis
distance. Now, look at those distances to identify extreme cases on
the predictor variable side. Click on the Residuals 2 tab and then select
Mah. Dis. from the
X (var/pred/res) box. Next, click the Histogram
of selected X (variable, predicted, or residual value) button to
display a histogram of the distribution of Mahalanobis distances.
It appears that there is one outlier case on Mahalanobis distances.
To identify this case, click on the Residuals 1 tab, select Mahalanobis
distance in the Sort obs by
drop-down box, then click the Predicted
and residuals button to display the Observed,
Predicted, and Residual Values spreadsheet.
Note that Shelby county (in
the first line) appears somewhat extreme as compared to the other counties
in the spreadsheet. If you look at the raw data you will find that, indeed,
Shelby county is by far the largest
county in the data file with many more persons employed in agriculture
(variable N_Empld). Probably,
it would have been wise to express those numbers in percentages rather
than in absolute numbers, and in that case, the Mahalanobis distance of
Shelby county from the other
counties in the sample would probably not have been as large. As it stands,
however, Shelby county is clearly
Deleted residuals. Another very
important statistic that allows one to evaluate the seriousness of the
outlier problem is the deleted
residual. This is the standardized residual for the respective case
that one would obtain if the case were excluded from the analysis. Remember
that the multiple regression procedure fits a regression surface to express
the relationship between the dependent and predictor variables. If one
case is clearly an outlier (as is Shelby
county in this data), then there is a tendency for the regression surface
to be "pulled" by this outlier so as to account for it as much
as possible. As a result, if the respective case were excluded, a completely
different surface (and B coefficients) would emerge. Therefore, if the
deleted residual is grossly different from the standardized residual,
you have reason to believe that the regression analysis is seriously biased
by the respective case. In this example, the deleted residual for Shelby county is an outlier that seriously
affects the analysis. You can plot the residuals against the deleted residuals
by first selecting the Raw option
button under Resids for default plots
and then click the Res. & del. res.
button, which will produce a scatterplot
of these values. The scatterplot below clearly shows the outlier; note
that in order to label the outlier (Shelby),
click the toolbar button to display the Brushing 2D dialog, and then Label the respective point.
Normal probability plots. There
are many additional graphs available from the Residuals
tabs. Most of them are more or less straightforward in their interpretation;
however, the normal
probability plots will be commented on here.
As previously mentioned, multiple linear regression assumes linear relationships
between the variables in the equation, and the normal
distribution of residuals. If these assumptions are violated, your
final conclusion may not be accurate. The normal probability plot of residuals
will give you an indication of whether or not gross violations of the
assumptions have occurred. Click the Normal
button under Probab. plot of resids
to produce this plot.
This plot is constructed as follows. First the standardized residuals
are rank ordered. From these ranks, z values can be computed (i.e., standard
values of the normal
distribution) based on the assumption that the data come from a normal
distribution. These z values are plotted on the y-axis in the plot.
If the observed residuals (plotted on the x-axis) are normally distributed,
then all values should fall onto a straight line in the plot; in this
plot, all points follow the line very closely. If the residuals are not
normally distributed, then they will deviate from the line. Outliers
may also become evident in this plot.
If there is a general lack
of fit, and the data seem to form a clear pattern (e.g., an S shape)
around the line, then the dependent
variable may have to be transformed in some way (e.g., a log transformation
to "pull in" the tail of the distribution, etc.). A discussion
of such techniques is beyond the scope of this example (Neter, Wasserman,
and Kutner, 1985, pp. 134-141, present an excellent discussion of transformations
as remedies for non-normality and non-linearity); however, too often researchers
simply accept their data at face value without ever checking for the appropriateness
of their assumptions, leading to erroneous conclusions. For that reason,
one design goal of the GLM module was to make residual
(graphical) analysis as easy and accessible as possible.
See also GLM - Index.