A Tutorial on Clustering Algorithms
Introduction  Kmeans  Fuzzy Cmeans  Hierarchical  Mixture of Gaussians  Links
Hierarchical
Clustering Algorithms
How
They Work
Given a set of N items to be clustered, and an N*N distance (or similarity)
matrix, the basic process of hierarchical clustering (defined by S.C.
Johnson in 1967) is this:
Step 3 can be done
in different ways, which is what distinguishes singlelinkage from
completelinkage and averagelinkage clustering.
In singlelinkage clustering (also called the connectedness
or minimum method), we consider the distance between one cluster and
another cluster to be equal to the shortest distance from any member
of one cluster to any member of the other cluster. If the data consist of similarities,
we consider the similarity between one cluster and another cluster to be equal
to the greatest similarity from any member of one cluster to any member
of the other cluster.
In completelinkage clustering (also called the diameter or
maximum method), we consider the distance between one cluster and another
cluster to be equal to the greatest distance from any member of one cluster
to any member of the other cluster.
In averagelinkage clustering, we consider the distance between one
cluster and another cluster to be equal to the average distance from
any member of one cluster to any member of the other cluster.
A variation on averagelink clustering is the UCLUS method of R.
D'Andrade (1978) which uses the median distance, which is much more
outlierproof than the average distance.
This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied.
(*) Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k1 longest links.
SingleLinkage
Clustering: The Algorithm
Let’s now take a deeper look at how Johnson’s algorithm works in
the case of singlelinkage clustering.
The algorithm is an agglomerative scheme that erases rows and columns in the
proximity matrix as old clusters are merged into new ones.
The N*N proximity matrix is D = [d(i,j)]. The clusterings are assigned sequence numbers 0,1,......, (n1) and L(k) is the level of the kth clustering. A cluster with sequence number m is denoted (m) and the proximity between clusters (r) and (s) is denoted d [(r),(s)].
The algorithm is composed of the following steps:

An
Example
Let’s now see a simple example: a hierarchical clustering of distances
in kilometers between some Italian cities. The method used is singlelinkage.
Input distance matrix (L = 0 for all the clusters):
BA 
FI 
MI 
NA 
RM 
TO 

BA 
0 
662 
877 
255 
412 
996 
FI 
662 
0 
295 
468 
268 
400 
MI 
877 
295 
0 
754 
564 
138 
NA 
255 
468 
754 
0 
219 
869 
RM 
412 
268 
564 
219 
0 
669 
TO 
996 
400 
138 
869 
669 
0 
The nearest pair
of cities is MI and TO, at distance 138. These are merged into a single cluster
called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and
the new sequence number is m = 1.
Then we compute the distance
from this new compound object to all other objects. In single link clustering
the rule is that the distance from the compound object to another object is
equal to the shortest distance from any member of the cluster to the outside
object. So the distance from "MI/TO" to RM is chosen to be 564, which
is the distance from MI to RM, and so on.
After merging MI with TO we obtain the following matrix:
BA 
FI 
MI/TO 
NA 
RM 

BA 
0 
662 
877 
255 
412 
FI 
662 
0 
295 
468 
268 
MI/TO 
877 
295 
0 
754 
564 
NA 
255 
468 
754 
0 
219 
RM 
412 
268 
564 
219 
0 
min d(i,j) = d(NA,RM)
= 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m = 2
BA 
FI 
MI/TO 
NA/RM 

BA 
0 
662 
877 
255 
FI 
662 
0 
295 
268 
MI/TO 
877 
295 
0 
564 
NA/RM 
255 
268 
564 
0 
min d(i,j) = d(BA,NA/RM) = 255 =>
merge BA and NA/RM into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m = 3
BA/NA/RM 
FI 
MI/TO 

BA/NA/RM 
0 
268 
564 
FI 
268 
0 
295 
MI/TO 
564 
295 
0 
min d(i,j) = d(BA/NA/RM,FI) = 268
=> merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m = 4
BA/FI/NA/RM 
MI/TO 

BA/FI/NA/RM 
0 
295 
MI/TO 
295 
0 
Finally, we merge the last two clusters at level 295.
The process is summarized by the following hierarchical tree:
Problems
The main weaknesses of agglomerative clustering methods are:
Bibliography
Hierarchical clustering interactive demo