SANN
Example 5: Cluster Analysis in SANN
The
term cluster analysis (first used by Tryon, 1939) actually encompasses
a number of different classification algorithms that can be used to develop
taxonomies (typically as part of exploratory data analysis). For example,
biologists have to organize the different species of animals before a
meaningful description of the differences between animals is possible.
According to the modern system employed in biology, man belongs to the
primates, the mammals, the amniotes, the vertebrates, and the animals.
Note how in this classification, the higher the level of aggregation the
less similar are the members in the respective class. Man has more in
common with all other primates (e.g., apes) than it does with the more
"distant" members of the mammals (e.g., dogs), etc.
In STATISTICA
Automated Network Networks (SANN) Cluster Analysis, Kohonen training
is used to determine the underlying clusters in the data. Kohonen training
is an algorithm that assigns cluster centers to a radial layer by iteratively
submitting training patterns to the network, and adjusting the winning
(nearest) radial unit center, and its neighbors, toward the training pattern
(Kohonen, 1982; Fausett, 1994; Haykin, 1994; Patterson, 1996). These Kohonen
networks are also known as self-organizing feature maps (SOFM) (Kohonen,
1982; Fausett, 1994; Haykin, 1994; Patterson, 1996).
Note:
The results shown in this example may be slightly different for your analysis
because the neural network algorithms use random number generators for
fixing the initial value of the weights (starting points) of the neural
networks, which often result in obtaining slightly different (local minima)
solutions each time you run the analysis. Also note that changing the
seed for the random number generator used to create the train, test, and
validation samples can change your results.
Data.
We will use the classic IRIS
data set. IrisSNN.sta contains
information about three different types of Iris flowers - Iris Versicol,
Iris Virginic, and Iris Setosa. The data set contains measurements
of four variables (sepal length and
width, and petal length and width).
The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to
Versicol, and the rest belong
to Virginic. In addition, the
data is well clustered with Setosa
being well separated from Versicol
and Virginic, while there is
a small amount of overlap between Versicol
and Virginic. This last property
makes the data set particularly suitable for cluster analysis.
Specifying
the analysis. Open the IrisSNN.sta
data file and start SANN:
Ribbon
bar. Select the Home tab.
In the File group, click the
Open arrow and select Open
Examples to display the Open
a STATISTICA Data File dialog box. Open the data file, which is
located in the Datasets folder.
Then, select the Statistics tab.
In the Advanced/Multivariate
group, click Neural Nets to display
the SANN
- New Analysis/Deployment Startup Panel.
Or, select the Data Mining tab.
In the Learning group, click
Neural Networks to display the
SANN - New Analysis/Deployment
Startup Panel.
Classic
menus. From the File
menu, select Open
Examples. In the Open
a STATISTICA Data File dialog,
double-click the Datasets folder,
and then double-click on IrisSNN.sta.
Then, from the Statistics menu
or the Data Mining menu, select
Automated Neural Networks to
display the SANN - New Analysis/Deployment Startup
Panel.
In the New
analysis list box, select Cluster
analysis, and then click the OK
button to display the SANN
- Data selection dialog box.
On the Quick tab, click the Variables
button to display a standard variable selection dialog box. In SANN,
variable selection is limited to the variable types required by the selected
analysis type. For Cluster Analysis,
two types of variables can be selected Continuous
inputs (predictor) and Categorical
inputs (predictor).
In the Continuous
inputs (predictors) column, select variables 2-5.

Then click the OK
button to return to the SANN
- Data Selection dialog box.

Notice that for Cluster
Analysis, Custom neural networks
(CNN) is the only strategy available.
Select the Sampling (CNN and ANS) tab. The performance of a neural network is measured
by how well it generalizes to unseen data (i.e., how well it predicts
data that was not used during training). The issue of generalization is
actually one of the major concerns when training neural networks. When
the training data have been overfit, it is difficult for the network to
make accurate predictions using new data.
One way to combat this problem is to split
the data into two (or three) subsets: a training sample, a testing sample,
and a validation sample. These samples can then be used to 1) train the
network, 2) verify (or test) the performance of the training algorithms
as they run, and 3) perform a final validation test to determine how well
the network predicts ”new” data.
In SANN,
the assignment of the cases to the subsets can be performed randomly or
based upon a special subset variable in the data set. For this example,
we will use the default settings, so click the OK
button to display the SANN - Custom Neural Network dialog box.
Training
the network. For cluster analysis, there are three tabs in the
SANN - Custom Neural Network
dialog box: Quick (Kohonen), Kohonen Training, and Real
time training graph.
Select the Quick (Kohonen) tab. On this tab, you
can specify the dimensions of the topological map (output layer), which
is laid out as a rectangular lattice. The dimensions specified here will
be used in training the network and in subsequent graphs, e.g., the Kohonen
graph. For this example, set the Topological
height to 3 and the Topological width to 6.
Note that,
depending on your future analyses, you may need to change these
quantities. This is an additional decision that you need to make for cluster
analysis – determining the dimension of the topological map. For most
problems, determining the correct numbers can require a certain amount
of trial and error.

Now select the Kohonen Training tab. There are several
options on this tab, but we will only examine two of them.
Neighborhoods.
This is the "radius" of a square neighborhood centered on the
winning unit. For example, a neighborhood size of 2 specifies a 5x5 square.
If the winning node is placed near or on
the edge of the topological map, the neighborhood is clipped to the edge.
The neighborhood is scaled linearly from the Start
value to the End value given.
The neighborhood size is stored and scaled
as a real number. However, for the sake of determining neighbors, the
nearest integral value is taken. Thus, the actual neighborhood used decreases
in a number of discrete steps. It is not uncommon to observe a sudden
change in the performance of the algorithm as the neighborhood changes
size. The neighborhood is specified as a real number since this gives
you greater flexibility in determining when exactly the changes should
occur. For this example, we will leave the settings at default.
Network
randomization.
Use the options in this group box to specify how the weights should be
initialized at the beginning of training. You can select Normal
randomization or Uniform randomization.
In addition to selecting a distribution, you must also specify the Mean\Min and Variance\Max
to use. You can change the default mean/min and variance/max settings,
but it is generally recommended that you set the mean/min to zero and
variance/max no more than 0.1. This will help the network to gradually
grow from its linear (small weight values) to nonlinear (large weight
values) mode for modeling the data as and when necessary during the training
process. For our example, we will leave these options at their
default.
The tab should look as shown below.

Reviewing
the results. Now click the Train
button. After training, the SANN
- Results dialog box is displayed.

The Kohonen
Results dialog box contains four tabs including Predictions
(Kohonen), Graphs, Kohonen graph, and Custom
predictions.
As with any SANN
analysis, and when applicable, you can generate results using either train,
test, validation, or all samples data. In this example we use the train
sample, but the steps equally apply to any sample. Note that you can specify
the sample type by selecting one or more check boxes in the Sample
group box of the Results dialog
box.
To start with, let's explore the options
on the Predictions (Kohonen)
tab. Via the Predictions button,
you can create a predictions spreadsheet for a specified sample. You can
also include or exclude various quantities in the spreadsheet such as
inputs, winning neuron position, and winning neuron activation.
For this example, select the Winning
neuron position check box and the Winning
neuron activation check box.
The Kohonen map has 18 (3 x 6 = ) neurons.
When a data case is passed through a Kohonen network, the position of
the case (which lives in a k-dimensional
space, with k being the number
of the inputs of the network) is mapped onto a 2-dimensional lattice in
which the Kohonen neurons are arranged. For a particular data case, the
winning neuron is one that has the closest Euclidean distance to the data
case. Whether a winner or not, each neuron has a position and a unique
ID number. For this example (3 x 6), neuron 1 has ID (1, 1), neuron 6
is identified as (1, 6) and neuron 18 is identified as (3, 6) with 3 being
the height and width of the network lattice.
This information can be found in the Predictions spreadsheet (provided you
include the Winning neuron position
and activation). The spreadsheet
below shows, for example, that case 1 was closest to neuron (2, 6) which
has the smallest activation function (shortest Euclidean distance) to
the data case.

Note that you can examine the prediction
of the Kohonen network on a case by case basis using the options in the
Single predictions group box.
Use the field to enter a data case of interest, say case 1. Then click
the Activations spreadsheet button
and the Activations histogram
button. They both give you the activation function of the neurons with
respect to the data case.


You can also create a Frequencies spreadsheet
or histogram that you can use to see how many cases belong to a particular
neuron.

Interpretation
of the Kohonen graph. Perhaps the most useful examination of your
results for this type of cluster analysis can be made using the Kohonen
graph shown below (select the Kohonen
graph tab).

This Topological Map window presents various
pieces of information to help you make sense of the Kohonen network (SOFM)
network. Each square on the tab represents a neuron in the topological
lattice. As you move the mouse over the Topological Map window, a ToolTip
will be displayed containing the position of the neuron and the number
of times it has been a winner (i.e., winning frequency). You can use the
frequency to observe where on the topological map clusters have formed.
The network is run on all cases in the training set (test, validation,
or all samples), and a count is made of how many times each unit wins
(i.e., is closest to the tested case). High win frequencies indicate the
centers of clusters on the topological map. Units with zero frequencies
aren't being used at all, and are generally regarded as an indication
that learning was not very successful (as the network isn't using all
the resources available to it). However, in this case there are so few
training cases that some unused units are inevitable.
To print the information displayed in the
Kohonen graph, click the Select all
button, and then click the Kohonen
3-d histogram button or spreadsheet button. You can also repeat this action
for any number of selected neurons (cells in the Kohonen graph). An example
for train and test samples (i.e., all samples) is shown below.


A careful examination of the spreadsheet
above indicates that most of the first 50 cases that belong to category
(class) Setosa actually belong
to a small number of neurons namely (1, 6), (2, 6) and (3, 6). This is
because Setosa is a well localized
category (i.e. its inputs are clustered in a relatively small volume of
the input space). This pattern, however, will dramatically
change beginning from case 50 and continues until case 100 as they all
belong to category Versicol,
where the winning neurons are (1, 3), (1, 4), (2, 3), (2, 4), (3, 3),
(3, 4) and (3, 5). A similar examination for Virginic
shows that the winning neurons are (1, 1), (1, 2), (2, 1), (2, 2), (2,
3), (3, 1), (3, 2), (3, 3) and (3, 4).
Note there are more winning neurons for Versicol and Virginic
than Setosa. That is because
the inputs belonging to the latter category are more localized compared
to those belonging to Versicol
and Virginic. Also note that
there are some overlapping between Versicol
and Virginic (they share a few
winning neurons) while none exists between them and Setosa.
This is because Versicol and
Virginic have some overlapping
in the space of the input data.