Example 1: Joining (Tree Clustering)

Data File. This example is based on a sample of different automobiles. Specifically, one particular model was randomly chosen from among those offered by the respective manufacturer. The following data for each car were then recorded:

1. The approximate price of the car (variable Price),

2. The acceleration of the car (0 to 60 in seconds; variable Acceler),

3. The braking performance of the car (braking distance from 80 mph to complete standstill; variable Braking),

4. An index of road holding capability (variable Handling), and

5. The gas-mileage of the car (miles per gallon; variable Mileage).

Scale of Measurement. All clustering algorithms at one point need to assess the distances between clusters or objects, and obviously, when computing distances, you need to decide on a scale. Because the different measures included here used entirely different types of scales (e.g., number of seconds, thousands of dollars, etc.), the data were standardized (via the Standardize command from the Data menu) so that each variable has a mean of 0 and a standard deviation of 1. It is very important that the dimensions (variables in this example) that are used to compute the distances between objects (cars in this example) are of comparable magnitude; otherwise, the analysis will be biased and rely most heavily on the dimension that has the greatest range of values.  

The standardized data for this example are contained in the file Cars.sta. Open this data file via the File - Open Examples menu; it is in the Datasets folder.

Purpose of the Analysis. Given these data, can the taxonomy for the automobiles included in the study be developed?  In other words, do these automobiles form "natural" clusters that can be labeled in a meaningful manner?  First, perform a joining analysis (tree clustering, hierarchical clustering) on this data.

Specifying the Analysis. Select Cluster Analysis from the Statistics - Multivariate Exploratory Techniques menu to display the Clustering Method Startup Panel. Here, select Joining (tree clustering) and then click the OK button. Next, click the Variables button on the Cluster Analysis: Joining (Tree Clustering) dialog box - Quick tab to display the standard variable selection dialog box and select all of the variables. Then click the OK button to return to the Cluster Analysis: Joining (Tree Clustering) dialog box - Quick tab.

Now, we want to cluster the automobile (cases) based on the different performance indices (variables). However, the default setting of the Cluster box on the Advanced tab is Variables (columns); so we need to change this setting. Depending on the research question at hand, one may cluster cases in some instances and variables in others. For example, we could be interested in whether the car performance measures (variables) form natural clusters. However, in this instance, we want to know whether the cars (cases) form clusters, and, therefore, we need to select Cases (rows) in the Cluster box. Also, select Complete Linkage in the Amalgamation (linkage) rule box; this will be discussed shortly.

Distance measures. Remember that the tree clustering method will successively link together objects of increasing dissimilarity or distance. There are various ways to compute distances, and they are explained in the Introductory Overview. The most straightforward way to compute a distance is to consider the k variables as dimensions that make up a k-dimensional space. If there were three variables, then they would form a three-dimensional space. The Euclidean distance in that case would be the same as if we were to measure the distance with a ruler. Accept this default measure (Euclidean distance) in the Distance measure box for this example.

Amalgamation (Linkage) rule. The other issue of some ambiguity in tree clustering is exactly how to determine the distances between clusters. Should we use the closest neighbors in different clusters, the furthest neighbors, or some aggregate measure? As it turns out, all of these methods (and more) have been proposed. The default method (Single Linkage) is the "nearest neighbor" rule. Thus, as we proceed to form larger and larger clusters of less and less similar objects (cars), the distance between any two clusters is determined by the closest objects in those two clusters. Intuitively, it may occur to us that this will likely result in "stringy" clusters, that is, STATISTICA will chain together clusters based on the particular location of single elements. Alternatively, we can use the  Complete Linkage rule. In this case, the distance between two clusters is determined by the distance of the furthest neighbors. This will result in more "lumpy" clusters. As it turns out for this data, the single linkage rule does in fact produce rather "stringy" and undistinguished clusters.

So, for this analysis, the Complete Linkage method was chosen in the Amalgamation (linkage) rule box.

Results. Click the OK button in the Cluster Analysis: Joining (Tree Clustering) dialog box to begin the analysis. The tree clustering method is an iterative procedure, and after all objects have been joined, the Joining Results dialog box will be displayed. For this example, select the Advanced tab.

The tree diagram. The most important result to consider in a tree clustering analysis is the hierarchical tree. The Cluster Analysis module of STATISTICA offers two types of tree diagrams with two types of branches. For the standard style of tree diagram, select the Rectangular branches check box and click the Horizontal hierarchical tree plot button to create a horizontal Tree Diagram (Complete Linkage).

We can also create the tree diagram in a vertical style by clicking the Vertical icicle plot button.

Also, the branches of both types of tree diagrams can be set to rectangular or diagonal. To produce a tree diagram with diagonal branches, clear the Rectangular branches check box. The diagonal format may increase the readability of the diagram for solutions with "balanced" joining structures.

In addition, we can choose to scale the tree plot to a standardized scale with the Scale tree to dlink/dmax*100 check box. When we select this check box, the horizontal axis (or vertical axis for vertical icicle plots) will be scaled in percentages, specifically, as dlink/dmax*100. Thus, it represents the percentage of the range from the maximum to the minimum distance in the data. If this check box is cleared, the scale will be based on the previously selected distance measure.  

At first, tree diagrams may seem a bit confusing; however, once they are carefully examined, they will be less confusing. The diagram begins on the left in horizontal tree diagrams (or on the bottom in vertical icicle plots) with each car in its own cluster. As we move to the right (or up in vertical icicle plots), cars that are "close together" are joined to form clusters. Each node in the above diagrams represents the joining of two or more clusters; the locations of the nodes on the horizontal (or vertical) axis represent the distances at which the respective clusters were joined.

Identifying Clusters. For this discussion, consider only horizontal hierarchical tree diagrams (see the tree diagram with the standardized scale), and begin at the top of the diagram. Apparently, first there is a cluster consisting of only Acura and Olds; next there is a group (i.e., cluster) of seven cars: Chrysler, Dodge, VW, Honda, Pontiac, Mitsubishi, and Nissan. As it turns out, in this sample the entry level models (more or less) of these brands were chosen. Thus, we may want to call this cluster the "economy sedan" cluster.  

The first two cars, Acura and Olds, join this cluster at the approximate linkage distance of 32; after that (to the right), this branch of the tree extends out to 60. Thus, these two cars could also be considered as members of the economy sedan cluster. Moving down the plot, a cluster starting with Audi extends to Ford, perhaps all the way to Eagle. These cars (i.e., the particular models chosen for the sample) more or less represent high-priced, luxury sedans; thus, this cluster can be identified as the "luxury" sedan cluster.  

Finally, at the bottom of the plot there are the Corvette and Porsche that are joined at the linkage distance of approximately 30.  

Amalgamation schedule. A non-graphical presentation of these results is the amalgamation schedule (click the Amalgamation schedule button on the Joining Results dialog box - Advanced tab).

The Amalgamation Schedule results spreadsheet lists the objects (cars that are joined together at the respective linkage distances (in the first column of the spreadsheet).

Graph of amalgamation schedule. Click the Graph of amalgamation schedule button to display a line graph of the linkage distances at successive clustering steps.

This graph can be very useful by suggesting a cutoff for the tree diagram.  Remember that in the tree diagram, as we move to the right (increase the linkage distances), larger and larger clusters are formed of greater and greater within-cluster diversity. If this plot shows a clear plateau, it means that many clusters were formed at essentially the same linkage distance. That distance may be the optimal cut-off when deciding how many clusters to retain (and interpret).