Multidimensional Scaling - Example

Overview and Data File. This example is based on the data file Nations.sta. These data are discussed in Kruskal and Wish (1978, page 30). The data file contains the mean similarity ratings of 18 students for 12 countries. The countries are Brazil, Congo, Cuba, Egypt, France, India, Israel, Japan, Mainland China, Russia, USA, and Yugoslavia. A partial listing of this similarity matrix is shown below.

Note that you can produce a similar matrix file by entering the distances into a new spreadsheet following the matrix data format conventions (as described in the Matrix File Format topic).

Specifying the Analysis. Open the Nations.sta data file, and start the Multidimensional Scaling module:

Ribbon bar. Select the Home tab. In the File group, click the Open arrow and on the menu, select Open Examples to display the Open a STATISTICA Data File dialog box. The data file is located in the Datasets folder. Then, select the Statistics tab. In the Advanced/Multivariate group, click Mult/Exploratory and on the menu, select Multidimensional Scaling to display the Multidimensional Scaling Startup Panel.

Classic menus. On the File menu, select Open Examples to display the Open a STATISTICA Data File dialog box. The Nations.sta data file is located in the Datasets folder. Then, on the Statistics - Multivariate Exploratory Techniques submenu, select Multidimensional Scaling to display the Multidimensional Scaling Startup Panel.

In the Startup Panel on the Quick tab, click the Variables button to display a standard variable selection dialog box. Select all of the variables (i.e., objects or nations) for the analysis, and then click the OK button to close the variable selection dialog box and return to the Startup Panel.

STATISTICA assumes that you want to calculate a two-dimensional solution for this similarity matrix, and that the initial solution is to be estimated via principal components analysis. Alternatively, on the Options tab, you could also specify the initial configuration by selecting a STATISTICA raw data file with the initial coordinates.

Click the OK button to simply accept the default settings. First, the initial (starting) configuration will be computed, and the Parameter Estimation dialog box will be displayed. [Note that you can later view these initial configurations by clicking the Start (initial) configuration button on the Results dialog box - Review & Save tab.]

Performing the Analysis. The iterative algorithm for finding an optimum configuration proceeds in two stages: First, STATISTICA will use a method known as steepest descent. The respective number of steepest descent iterations is listed in the Parameter Estimation dialog box in the first column (labeled iter. s).

After each iteration under steepest descent, STATISTICA will perform up to five additional iterations to "fine-tune" the configuration (see Technical Notes for details). The respective numbers of these iterations are listed in the Parameter Estimation dialog box in the second column (labeled iter. t).

In addition, the stress value (Kruskal, 1964) and coefficient of alienation (Guttman, 1968) are calculated and displayed at each step (see also the Introductory Overviews and Technical Notes). A detailed discussion of this iterative procedure can be found in Shiffman, Reynolds, and Young (1981, pages 366-370).

After STATISTICA has determined the best two-dimensional configuration, it will display the final stress value; click the OK button to display the Results dialog box.

Results. You can examine the results in spreadsheets or graphs via the options available in the Results dialog box.

First, examine the table of actual distances and estimated distances.

Reproduced and observed distances. To evaluate the fit of the two-dimensional solution, click the Summary statistics button on the Advanced tab.

The columns labeled D-hat and D-star contain the monotone transformations of the input data (see the Introductory Overviews): D-stars are rank images calculated according to Guttman (1968); D-hats are monotone regression estimates calculated according to Kruskal (1964).

The rows in the spreadsheet, each representing one distance as specified in the similarity matrix, are sorted according to the size of D-star or D-hat. The second column of the spreadsheet contains the reproduced Distances from the current configuration. If the fit of the current model (i.e., the current number of dimensions) is very good, then the order of reproduced distances should be approximately the same as that for the transformed input data (i.e., D-star or D-hat values). Out-of-order elements indicate lack of fit. The first column of the spreadsheet references the elements of the original input matrix as D(X,Y), where X is the respective row in the input matrix, and Y is the respective column.

For example, D(2,1) would be the element in the second row and the first column of the input matrix (i.e., in our example, the comparison between Congo and Brazil). It appears that, by and large, the order of distances was approximately reproduced by the two-dimensional solution.

Shepard diagram. Now examine the Shepard plot. As described in the Introductory Overviews, this plot is a scatterplot of the observed input data (similarities or dissimilarities) against the reproduced distances. The plot will also show the D-hat values, that is, the monotonically transformed input data, as a step function. To produce this plot, click the Shepard diagram button on the Quick or Advanced tab.

Most points in this plot are clustered around the step-line. Thus, you may conclude for now that this two-dimensional configuration is adequate for describing the similarities between countries.

Interpreting the configuration. In order to interpret this solution, you can display the configuration of nations in the two-dimensional space. Return to the Advanced tab and then click the Graph final configuration, 2D button. The Select two dimensions for scatterplot intermediate dialog box will be displayed, in which you can select the dimensions for the 2D scatterplot. Select Dimension 1 as the First (X), Dimension 2 as the Second (Y), and then click the OK button to produce the plot.

As described in the Introductory Overviews, the actual orientation of axes in multidimensional scaling is arbitrary (just as in Factor Analysis). Thus, you can rotate the configuration in order to achieve a more interpretable solution. Kruskal and Wish (1978) used a program called KYST (which uses a slightly different algorithm for multidimensional scaling) in order to analyze the present data, and they obtained a very similar solution. They then rotated their solution by approximately 45 degrees, and interpreted the rotated dimensions as developed vs. underdeveloped, and pro-western vs. pro-communist. Looking at the plot below (and mentally rotating it by 45 degrees), this interpretation seems to hold quite well (remember that this study was conducted in the 1970's).

(Note that in the plot above, the scaling has been adjusted via the All Options dialog box - Scaling tab.) In general, in addition to "meaningful dimensions," you should also look for clusters of points or particular patterns and configurations (such as circles, manifolds, etc.). For a detailed discussion of how to interpret final configurations, see Borg and Lingoes (1987), Borg and Shye (in press), or Guttman, (1968).

Continuing the Analysis. Now click the Cancel button in the Results dialog box to return to the Multidimensional Scaling Startup Panel, and select the Options tab.

Note that now the default settings on the Options tab are different than when the program was first started. Multidimensional Scaling will remember the configuration from the previous analysis (unless you specify a new data file or if you select new cases). Also, the default Number of dimensions on the Quick tab is now 1. You could now click the OK button to compute the one-dimensional solution, using the configuration for the first dimension from the previous analysis as the starting configuration. In this manner, you can efficiently evaluate several consecutive solutions, starting with several dimensions (and working your way down to the one-dimensional solution).

Scree test: Plotting the stress values. This example began with the two-dimensional solution. Actually, if you are unsure about the dimensionality underlying the matrix, you should plot the stress values for consecutive numbers of dimensions. Then, find the place where the smooth decrease of stress values appears to level off to the right of the plot. To the right of this point, presumably, one finds only "factorial scree" - "scree" is the geological term referring to the debris that collects on the lower part of a rocky slope (see, for example, Kruskal and Wish, 1978, pages 53-56, for a discussion of this plot). Shown below is this scree plot. [Note that this plot was created by first creating a new spreadsheet containing the D-star: Raw stress values (that can be found in the Results dialog box summary box) for consecutive dimensions (1 through 6) for the present data,

and then selecting Line Plot (Variables) on the Graphs - 2D Graphs tab or menu.]

Based on the plot above, the two-dimensional solution would, indeed, have been chosen. Perhaps you also would have looked at the three-dimensional solution. You can be the judge of whether the three-dimensional solution is more meaningful than the two-dimensional one. Shown below is a 3D scatterplot of the solution when 3 is specified as the Number of dimensions on the Multidimensional Scaling Startup Panel - Quick tab. To produce this graph, click the Graph of final configuration, 3D button on the Results - Quick tab (this button is dimmed if 1 or 2 dimensions were specified in the Startup Panel; note that when you click this button you will be prompted to select the dimensions to plot in the 3D graph via the Select three dimensions for scatterplot dialog box).

See also the Multidimensional Scaling Index and Overviews.