Example 8: Multiple Regression Analysis

This example is based on the data file Poverty.sta that is included with your STATISTICA program. Open this data file via the File - Open Examples menu; it is in the Datasets folder. Refer to Example 7 demonstrating simple regression analysis for a description of the datafile. See also the Introductory Overview of Multiple Regression for a discussion of these methods.

For this example several possible correlates of poverty will be analyzed and the relative degree to which each predicts the percent of families below the poverty line in a county will be determined. Thus, you will treat variable 3 (Pt_Poor) as the dependent or criterion variable, and the remaining variables will be treated as continuous predictor variables.

Specifying the analysis. To perform the analysis, select General Linear Models from the Statistics - Advanced Linear/Nonlinear Models menu to display the General Linear Models (GLM) Startup Panel. Select Multiple regression as the Type of analysis, Quick specs dialog as the Specification method, and then click the OK button to display the GLM Multiple Regression Quick Specs dialog. Click the Variables button to display the standard variable selection dialog. Here, select Pt_Poor as the Dependent variable list and the remaining variables as the Predictor variables, and then click the OK button to return to the GLM Multiple Regression Quick Specs dialog.

To view the syntax program automatically generated from the dialog specifications, click the Syntax editor button on the GLM Multiple Regression Quick Specs dialog to display the GLM Analysis Syntax Editor dialog.

GLM;

The remainder of the specifications for this analysis can use the default specifications, so click the OK (Run) button on the GLM Analysis Syntax Editor dialog or the OK button on the GLM Multiple Regression Quick Specs dialog to perform the analysis.

Reviewing Results.

Regression coefficients. When the GLM Results dialog comes up, click the Coefficients button on the Summary tab. In order to learn which of the independent variables contribute most to the prediction of poverty, examine the unstandardized regression (or B) coefficients and the standardized regression (or Beta) coefficients.

The Beta coefficients are the coefficients you would have obtained had you first standardized all of your variables to a mean of 0 and a standard deviation of 1. Thus, the magnitude of these Beta coefficients allows you to compare the relative contribution of each independent variable in the prediction of the dependent variable. As is evident in the spreadsheet shown above, variables Pop_Chng, Pt_Rural, and N_Empld are the most important predictors of poverty; of those, only the first two variables are statistically significant (their 95% confidence interval limits do not include 0). The regression coefficient for Pop_Chng is negative; the less the population increased, the greater the number of families who lived below the poverty level in the respective county. The regression weight for Pt_Rural is positive; the greater the percent of rural population, the greater the poverty level.

Significance of regressor effects. Click the Univariate results button to display a spreadsheet containing tests of significance.

As this spreadsheet shows, only the Pop_Chng and Pt_Rural effects are statistically significant, p's < .05.

Residual Analysis. After fitting a regression equation, one should always examine the predicted and residual scores. For example, extreme outliers may seriously bias results and lead to erroneous conclusions. From the GLM Results dialog, click the More results button and then the Residuals 1 tab to reveal the options for analysis of residuals.

Casewise plot of residuals. Usually, one should examine the pattern of the raw or standardized residuals to identify any extreme outliers. For this example, select Standardized under Resids for default plots. Then click the Case no. & res. button to display the following graph with a casewise plot of residuals.

The scale used for the vertical axis of the casewise plot is in terms of sigma, that is, the standard deviation of residuals. If one or several cases fall outside of the ± 3 times sigma limits, one should probably exclude the respective cases (which is easily accomplished via selection conditions) and run the analysis over to make sure that key results were not biased by these outliers.

Mahalanobis distances. Most statistics textbooks devote some discussion to the issue of outliers and residuals concerning the dependent variable. However, the role of outliers in the predictor variables is often overlooked. On the predictor variable side, you have a list of variables that participate with different weights (the regression coefficients) in the prediction of the dependent variable. One can think of the independent variables as defining a multidimensional space in which each observation can be located. For example, if you had two independent variables with equal regression coefficients, then you could construct a scatterplot of those two variables, and place each observation in that plot. You could then plot one point for the mean on both variables and compute the distances of each observation from this mean (now called the centroid) in the two-dimensional space; this is the conceptual idea behind the computation of the Mahalanobis distance. Now, look at those distances to identify extreme cases on the predictor variable side. Click on the Residuals 2 tab and then select Mah. Dis. from the X (var/pred/res) box. Next, click the Histogram of selected X (variable, predicted, or residual value) button to display a histogram of the distribution of Mahalanobis distances.

It appears that there is one outlier case on Mahalanobis distances. To identify this case, click on the Residuals 1 tab, select Mahalanobis distance in the Sort obs by drop-down box, then click the Predicted and residuals button to display the Observed, Predicted, and Residual Values spreadsheet.

Note that Shelby county (in the first line) appears somewhat extreme as compared to the other counties in the spreadsheet. If you look at the raw data you will find that, indeed, Shelby county is by far the largest county in the data file with many more persons employed in agriculture (variable N_Empld). Probably, it would have been wise to express those numbers in percentages rather than in absolute numbers, and in that case, the Mahalanobis distance of Shelby county from the other counties in the sample would probably not have been as large. As it stands, however, Shelby county is clearly an outlier.

Deleted residuals. Another very important statistic that allows one to evaluate the seriousness of the outlier problem is the deleted residual. This is the standardized residual for the respective case that one would obtain if the case were excluded from the analysis. Remember that the multiple regression procedure fits a regression surface to express the relationship between the dependent and predictor variables. If one case is clearly an outlier (as is Shelby county in this data), then there is a tendency for the regression surface to be "pulled" by this outlier so as to account for it as much as possible. As a result, if the respective case were excluded, a completely different surface (and B coefficients) would emerge. Therefore, if the deleted residual is grossly different from the standardized residual, you have reason to believe that the regression analysis is seriously biased by the respective case. In this example, the deleted residual for Shelby county is an outlier that seriously affects the analysis. You can plot the residuals against the deleted residuals by first selecting the Raw option button under Resids for default plots and then click the Res. & del. res. button, which will produce a scatterplot of these values. The scatterplot below clearly shows the outlier; note that in order to label the outlier (Shelby), click the  toolbar button to display the Brushing 2D dialog, and then Label the respective point.

Normal probability plots. There are many additional graphs available from the Residuals tabs. Most of them are more or less straightforward in their interpretation; however, the normal probability plots will be commented on here.

As previously mentioned, multiple linear regression assumes linear relationships between the variables in the equation, and the normal distribution of residuals. If these assumptions are violated, your final conclusion may not be accurate. The normal probability plot of residuals will give you an indication of whether or not gross violations of the assumptions have occurred. Click the Normal button under Probab. plot of resids to produce this plot.

This plot is constructed as follows. First the standardized residuals are rank ordered. From these ranks, z values can be computed (i.e., standard values of the normal distribution) based on the assumption that the data come from a normal distribution. These z values are plotted on the y-axis in the plot.

If the observed residuals (plotted on the x-axis) are normally distributed, then all values should fall onto a straight line in the plot; in this plot, all points follow the line very closely. If the residuals are not normally distributed, then they will deviate from the line. Outliers may also become evident in this plot.

If there is a general lack of fit, and the data seem to form a clear pattern (e.g., an S shape) around the line, then the dependent variable may have to be transformed in some way (e.g., a log transformation to "pull in" the tail of the distribution, etc.). A discussion of such techniques is beyond the scope of this example (Neter, Wasserman, and Kutner, 1985, pp. 134-141, present an excellent discussion of transformations as remedies for non-normality and non-linearity); however, too often researchers simply accept their data at face value without ever checking for the appropriateness of their assumptions, leading to erroneous conclusions. For that reason, one design goal of the GLM module was to make residual (graphical) analysis as easy and accessible as possible.

See also GLM - Index.