Example 2. Log-Linear Analysis of Frequency Tables

Overview. This example is based on a "classic" data set reported by Morrison, et al. (1973) and discussed by Bishop, Fienberg, and Holland (1975). The data is contained in the data file Center.sta that is included with your STATISTICA program. The data file contains a frequency table of the number of breast cancer patients who survived three years or longer after the diagnosis (obviously, this data is not representative of the chances of surviving breast cancer today).

The frequencies are reported separately for four different types of inflammation and appearance (MIN_MAL, MIN_BEN, GRT_MAL, GRT_BEN), three age groups (under 50, 50-69, over 69), and separately for three diagnostic centers (Tokyo, Boston, and Glamorgan). The complete table was entered into the spreadsheet, which is shown in the following image.

Note that the case name column is used to denote the levels of three factors, that is, the Location of the diagnostic center, the Age, and Survival (to the right in the case name column).

Goal of the analysis. In general, the goal of log-linear analysis of a frequency table is to uncover relationships between the categorical variables (factors) that make up the table. The Introductory Overview introduces the distinction between design variables and response variables, a distinction that basically corresponds to that between independent and dependent variables, respectively. The major response variable of interest in this table is Survival. All other factors are treated as design factors. Thus, you will not be concerned with any interactions between, for example, the location of the diagnostic center and the age of the patients or the appearance of the cancer.

Data Files. Before the actual analysis of the table begins, the different ways in which data files can be specified in the Log-Linear module will first be demonstrated. Note that the data file Center.sta (shown in the image above) contains only frequencies as values for the variables; there are no coding variables with text codes that identify the levels of the factors.

Open the Center.sta data file and start Log-Linear Analysis:

Ribbon bar. Select the Home tab. In the File group, click the Open arrow and select Open Examples to display the Open a STATISTICA Data File dialog box. Open the Center.sta data file, which is located in the Datasets folder. Then, select the Statistics tab. In the Advanced/Multivariate group, click Advanced Models and from the menu, select Log-Linear to display the Log-Linear Analysis Startup Panel.

Classic menus. From the File menu, select Open Examples to display the Open a STATISTICA Data File dialog box. Open the Center.sta data file, which is located in the Datasets folder. Then, from the Statistics - Advanced Linear/Nonlinear Models submenu, select Log-Linear Analysis of Frequency Tables to display the Log-Linear Analysis Startup Panel.

From the Input file drop-down list, select Frequencies w/out coding variables.

Then, click the Variables button to display the standard variable selection dialog box. Select all variables in the file, and click the OK button.

Specifying the table. In order to ensure that STATISTICA will understand how to interpret the data, that is, how to organize the numbers into the table, click the Specify table button to display the Specify the dimensions of the table dialog box (see the next image). Internally, STATISTICA will simply read the frequencies in the four variables as one long string of numbers, reading row by row, starting from the left-most variable. The information you supply via the Specify the dimensions of the table dialog box allows STATISTICA to "understand" the structure of the table.

There are four factors that need to be entered in the Specify the dimensions of the table dialog box: Appearnc (and inflammation) of the cancer, Survival, Age, and Location of the diagnostic center. The number of levels in each factor also needs to be specified.

STATISTICA will interpret the first factor entered in this dialog box to be the one with the "fastest-changing subscript," the second factor entered as the one with the second-fastest changing subscript, and so on. Because STATISTICA reads the frequencies across rows, the factor with the fastest-changing subscript in this example is the factor whose four levels are listed in the column headings of the spreadsheet - Appearnc. Therefore, enter Appearnc as the Factor Name and 4 as the No. of levels for that factor in the first line. The next-fastest changing factor is Survival; the levels of this factor change from line to line in the spreadsheet. Therefore, enter the Factor Name Survival and 2 levels in line 2. Specify the remaining two factors as follows: 3 levels for factor Age, and 3 levels for factor Location.

Click the OK button to return to the Startup Panel, where the table to be analyzed will be displayed in the Summary box (the area at the top of the dialog box).

At this point, the table is ready to be analyzed.

Saving a table in an alternative format. In order to make the results more readable with meaningful text values denoting the levels of the factors, a different data file setup is preferred. When you save tables from within the Log-Linear module, they will automatically be saved in this preferred setup.

You could now click the OK button as if you were to begin the analysis, and on the subsequent Log-Linear Model Specification dialog box - Review/Save tab, click the Save the table button to save the table. The resulting file would contain one line for each cell of the table. In addition to a variable containing the respective cell frequencies, that file will also include one variable for each factor in the table, with integer codes to denote the respective levels.

Such a file was previously created, and then appropriate text labels were added to it. This alternative way of representing the table was used in the data file Center2.sta. The following image shows a portion of this data file.

This file will be used in subsequent analyses, which will make the output more readable.

Specifying the Analysis. Close all analyses, data files, and workbooks currently open. Then, open the Center2.sta data file, and start a new Log-Linear analysis.

In the Log-Linear Analysis Startup Panel, from the Input file drop-down list, select Frequencies with coding variables (which is the way that the table is represented in the data file).

Now, click the Variables button to display the standard variable selection dialog box. Select Frequncy as the Variable with freq. counts, and select Appearnc through Location as the Variables with codes. Click the OK button to return to the Startup Panel.

Finally, click the Select codes button to display the Select codes for factors dialog box. Specify the respective codes that were used to denote the levels of the factors. To select all codes, you can use the asterisk (*) convention in each of the codes selection fields, click each of the All buttons, or click the Select All button to select all of the codes for each of the variables. Click the OK button to return to the Startup Panel, which will look like this.

Results. You are now ready to begin the analysis; click the OK button in the Startup Panel to display the Log-Linear Model Specification dialog box.

Observed table. First, select the Review/Save tab. You can review the observed table via the Review complete observed table button. When you click this button, the Specify how to Review the Table dialog box is displayed, in which you can flexibly specify the way in which to review the table.

For example, you can specify to review the table with Age (factor 3) as the Column variable and Location (factor 4) as the Row variable (within each level of the other factors), etc.

Click the OK button to display the specified tables.

Finding a model. Finding an appropriate model for a multi-way (more than two-way) frequency table is often not an easy task. The Log-Linear module provides a number of options to facilitate the search. In particular, the Automatic selection of best model button on the Log-Linear Analysis Model Specification dialog box - Quick tab and on the Advanced tab is useful because it will automatically find the least complex model that will fit the data. This button will be used later to see whether it will arrive at the same or similar conclusions (model) as you will when "left to your own devices."

Simultaneous test of all k-factor interactions. A first step toward understanding the degree of complexity of the table is to review the table of simultaneous tests for all k-factor interactions, and the tests of all marginal and partial association models. These tests will be computed when you click the Test all marginal & partial association models button on the Advanced tab or on the Review/Save tab.

The interpretation of the tests of all k-factor interactions is discussed in the Introductory Overview. In short, the spreadsheet above shows that the improvement in fit when including all 2-way interactions in the model (K-Factor = 2) is highly significant (i.e., the model provides a very poor fit). The improvement in fit when adding all 3-way interactions to the model (K-Factor = 3) is not significant (i.e., the model provides an adequate fit). Therefore, you can conclude that the least complex model that will fit the observed table need not contain any three-way associations, but will contain one or more two-way associations.

Tests of all marginal and partial associations. To see which of the two-way associations seems to be significant, the Tests of Marginal and Partial Association spreadsheet will be reviewed.

The interpretation of this table is also discussed in the Introductory Overview. In short, the Partial Association Chi-square test evaluates the significance of the respective effect (indicated by the digits in the Effect column) by comparing the model that includes all interactions of the same order with the model without the respective effect.

For example, look at Effect 12. This effect represents the association or interaction between factors 1-Appearnc and 2-Survival. When dropping this effect from the model with all other two-way associations, the difference in the (maximum likelihood) Chi-square values is equal to 10.18, with 3 Degrees of freedom. This value is significant at the p<.02 level. Therefore, the model fit becomes significantly worse when excluding this two-way interaction from the model; thus you would include it.

To make an analogy to Multiple Regression, the partial association test gauges the unique contribution of the respective effect (association or interaction) to the fit of the model. The logic of the test is analogous to that of partial correlations.

The Marginal Association test of Effect 12 denotes the difference between the model without any two-way interactions and the model that includes the 12 interaction (and no other two-way interactions). As you can see, the model fit improves significantly when adding the association between factors 1-Appearnc and 2-Survival (Chi-square = 9.49, df = 3, p<.03). To continue the analogy to multiple regression, this test is the equivalent to the regular (zero-order) correlation coefficient.

Choosing the effects for the model. If the Tests of Marginal and Partial Association spreadsheet is now reviewed for the significance of other two-way interactions, you will find that other associations that should be included in the model are:

(1) The association between factors 1-Appearnc and 2-Survival (Effect 12),

(2) The association between factors 1-Appearnc and 4-Location (Effect 14),

(3) The association between factors 2-Survival and 4-Location (Effect 24),

(4) The association between factors 3-Age and 4-Location (Effect 34).

The association between 2-Survival and 3-Age is not significant when evaluated with all other two-way associations (see Partial Association for Effect 23). Therefore, it will not be included in the model for now.

Rules for specifying a model. You are now ready to specify and test a particular model. However, before proceeding, please review a few important points. First, interaction or association effects automatically include lower-order effects. So, for example, if you specify the 12 association, you request to fit the marginal table of factors 1 by 2. This table will obviously contain or reflect the marginal tables of 1 and 2 alone. Therefore, it is not necessary to explicitly include the lower-order effects when specifying a model. Second, make sure that all effects are reflected in the model. For example, if you specify only 12 and make no other reference to other factors, then you basically hypothesize that the marginal frequencies for all other factors are equal. In this example, it would be silly to hypothesize that there are an equal number of patients diagnosed in the three diagnostic centers. Therefore, fitting a table that forces those numbers to be equal unnecessarily worsens the fit of the model.

Specifying the model. To specify the model that was previously derived, click the Specify model to be tested button on the Quick or Advanced tab. You can now type in the desired model in the Specify Model to be Tested dialog box. In this case, you know that you need to include the two-way associations 12, 14, 24, and 34. However, as discussed in the Introductory Overview, you generally include all interactions between design variables in the model so that they will not contribute to the overall lack of model fit. It was assumed in this study that you were not interested in any interactions between the inflammation-/appearance of the cancer, the age, and the diagnostic center. It may well be that the distribution of age is different in different diagnostic centers, or that the appearance of the cancer is different in different age groups. However, you are mostly interested in the factors that are associated with survival. Since you are not interested in any associations between design variables in this study, fit the three-way association (134) between all design factors, in addition to the 12 and 24 effects. Hence, to specify this model, type 12, 24, 134 into the Specify Model to be Tested dialog box.

Evaluating the goodness of fit. Now click the OK button to display the Results dialog box. As you can see in the results summary, the overall model fits the observed table (the Chi-square tests are not significant). Therefore, you can conclude that the specified model is sufficient to explain the frequencies in the table.

On the Quick tab, click the Plot of observed vs. fitted button to see whether there are any major discrepancies between the observed and the fitted frequencies in the table.

Most points in the graph above fall onto a straight line. Thus it appears that there are no major outliers ("misfitted" cells) in the table.

Note that you can use the Interactive Graphics Controls at the bottom of the graph window to adjust the transparency of the markers.

Hierarchical tests of alternative models. Before interpreting the results, test the statistical significance of the 24 and 12 associations and the significance of the association between Age and Survival (23), which is not included in this model. As described in the Introductory Overview, you can evaluate the statistical significance of effects by comparing the Chi-square of the model that includes the effect with the Chi-square of the model that excludes the effect.

For example, to test the 24 association, fit the model 12, 134 (using the Specify Model to be Tested dialog box as described above), and compare this Chi-square with the previous model Chi-square (this model is the same as the current model, except that the 24 interaction was dropped).

If you fit the model 12, 134, the resulting Maximum Likelihood ratio Chi-square value will be 43.37 with 32 degrees of freedom (see the summary box of the Results dialog box). This Chi-square is significantly worse (i.e., larger) than the previous model (which included the 24 interaction): The Chi-square difference is equal to 43.37 -31.74 =11.63, and the degrees of freedom difference is 32 - 30 = 2; the resulting significance level is p<.005. Therefore, you can conclude that the 24 interaction is significant (i.e., there is a significant association between the survival rate and the diagnostic center). Following the same logic, you will find that the Appearnc (factor 1) by Survival (factor 2) association is also highly significant (Chi-square difference = 10.23, df difference = 3, p<.025). To assess the significance of the 32 (Age and Survival) association that is not in the current model, add it to the model and assess the significance of the improvement in model fit. As you will see, the 32 association does not significantly improve the fit of the model to the observed table.

Interpreting the results. The analysis so far has yielded two significant effects, that is, associations between design variables and the response variable: 1) a relationship between Appearnc (factor 1) and Survival (factor 2), and 2) an association between Location (factor 4) and Survival (factor 2). Now, examine the nature of these effects. Remember that fitting a model involves the computations of the expected values so that they reflect the relative frequencies of the respective marginal tables. Therefore, to interpret an effect, you would examine the marginal tables. Click the Marginal tables button on the Results - Quick tab for the 12, 24,134 model in order to display those tables in individual spreadsheets (including the three-way table of the 134 effect). First, look at the association Marg. Tabl. (freq+delta): Appearnc by Survival spreadsheet.

Careful examination of this table reveals that the survival rate of patients whose cancer was diagnosed as malignant (column headers Min_Mal and Grt_Mal) is roughly 2 to 1 (Survival YES to NO); for benign cancer that rate is about 3 to 1.

Note that in order to simplify this example, the original appearance and inflammation factors were combined into the single 4-level factor Appearnc. In order to treat inflammation and appearance as separate factors in the table, you could split Appearnc into two variables and re-analyze this table.

The Marg. Tabl. (freq+delta): Survival by Location spreadsheet looks like this.

It seems that the survival rate is highest for cancer patients diagnosed in Tokyo, about 3 to 1 (Survival YES to NO). In Boston and Glamorgan, that rate stands at about 2 to 1. Of course, you cannot infer any specific cause for this effect. Obviously there are any number of differences (not measured in this study) between the patients in Tokyo and Boston or Glamorgan. However, the apparently differential survival rates would certainly warrant further study.

Note that the frequencies in the marginal table will include the Delta constant as specified on the Log-Linear Analysis Model Specification dialog box - Advanced tab. By default, STATISTICA will add 0.5 (the Delta) to each cell frequency before fitting any models. Therefore, in order to obtain accurate marginal counts when using the Log-Linear module, be sure to set the Delta constant to 0.

Automatic Stepwise Model Selection. The more complex the table, the more difficult it will be to find a model that fits and at the same time includes all effects of importance (significance). In fact, the final conclusion was arrived at "the hard way;" you could have also used the Automatic selection of best model button on the Log-Linear Analysis Model Specification dialog box - Advanced tab. Click the Cancel button in the Results dialog box to return to the Log-Linear Model Specification dialog box.

After you click the Automatic selection of best model button, the Automatic Selection of Best Model dialog box is displayed.

The algorithm. The algorithm used in the Log-Linear module to find a sufficient model for the observed table basically implements the same logic that you followed when you examined the k-level interactions table and the table of marginal and partial associations.

First, STATISTICA will determine the complexity or order of the interactions that need to be included in the model in order to make it fit to the observed table. The 1 - p(1) box controls the p-value that is used at this stage of the search to decide whether or not a model fits.

Next, STATISTICA will remove associations (of the order found in step one) from the model, step by step. At this stage, if an effect is found to be more significant than is specified in the 2 - p(2) box, then it is retained in the model.

The default settings for p(1) and p(2) are reasonable, so simply click the OK button to see which model will be chosen by STATISTICA.

Results. In the Automatic Selection of Best Model dialog box, you can see that the initial model consists of all two-way associations; this was also your starting point. The final model is basically the same one that you arrived at; namely, it includes the two major association effects of interest: 12 (Appearnc and Survival) and 24 (Location and Survival).

Note that this model is automatically "transferred" into the model specification dialog box (by default, the "best" model selected by STATISTICA will be entered into the edit field in the Specify Model to be Tested dialog box); thus, simply click the Further evaluate the best model button in the Automatic Selection of Best Model dialog box, and then click OK in the resulting Specify Model to be Tested dialog box, and the Results dialog box for the final model will be displayed.

Conclusions and Final Remarks. It can be concluded from the analysis of this table that the major factors associated with the 3-year survival of patients were the diagnosed malignancy of the cancer and the location of the center where it was diagnosed. Interestingly, age did not seem to be related to long-term survival.

As mentioned earlier, there exist any number of possible explanations for why the survival rate in Tokyo was higher than in the other diagnostic centers (time of diagnosis, dietary differences, cultural differences in "healthy behaviors," differences in the environment, etc.). However, the apparent differences revealed in this study would certainly be worthy of further investigation.

See also, Log-Linear Analysis of Frequency Tables - Index.