CLUSTER_NUMBER Function

PV-WAVE Advantage > IMSL Statistics Reference Guide > Multivariate Analysis > CLUSTER_NUMBER Function

Computes cluster membership for a hierarchical cluster tree.

Usage

result = CLUSTER_NUMBER(clson, crson, k)

Input Parameters

clson—Vector of length npt – 1 containing the left son cluster numbers, where npt is the number of data points to be clustered. Cluster npt + i is formed by merging clusters clson(i – 1) and crson(i – 1).

crson—Vector of length npt – 1 containing the right son cluster numbers. Cluster npt + i is formed by merging clusters clson(i – 1) and crson(i – 1).

k—Desired number of clusters.

Returned Value

result—Long vector of length npt containing the cluster membership of each observation.

Output Keywords

Obs_per_cluster—Array of length K containing the number of observations in each cluster.

Discussion

Given a fixed number of clusters (K) and the cluster tree (vectors crson and clson) produced by the hierarchical clustering algorithm (see CLUSTER_HIERARCHICAL Procedure) CLUSTER_NUMBER determines the cluster membership of each observation. It first determines the root nodes for the K distinct subtrees forming the K clusters and then traverses each subtree to determine the cluster membership of each observation. CLUSTER_NUMBER also returns the number of observations found in each cluster.

Example 1

In the following example, cluster membership for K = 2 clusters is found for the displayed cluster tree. The output vector iclus contains the cluster numbers for each observation.

; Set up the input data

k = 2

iclson = [5, 6, 4, 7]

icrson = [3, 1, 2, 8]

; Call the CLUSTER_NUMBER routine

iclus = CLUSTER_NUMBER(iclson, icrson, k, Obs_per_cluster=nclus)

PRINT,""

PRINT,"         OUTPUT"

PRINT,"       ----------"

PRINT,""

PRINT,"          iclus"

PRINT,iclus, Format="(5I5)"

Output

         OUTPUT

       ----------

          iclus

   1   2   1   2   1

Example 2

This example illustrates the typical usage of CLUSTER_NUMBER. The Fisher iris data is clustered. First the distance between the irises are computed using DISSIMILARITIES. The resulting distance matrix is then clustered using CLUSTER_HIERARCHICAL. The cluster membership for 5 clusters is then obtained via CLUSTER_NUMBER using the output from CLUSTER_HIERARCHICAL. The need for 5 clusters can be obtained either by theoretical means or by examining a cluster tree. The cluster membership for each of the iris observations is printed.

data = STATDATA(3)

k = 5

dist = DISSIMILARITIES(data,Index=[1,2,3,4])

RANDOMOPT,Set=4

FOR i=0L, 149 DO  BEGIN & $

  FOR j=i+1, 149 DO BEGIN & $

    r = RANDOM(1, /Uniform) & $

    dist(i,j) = MAX([0.0, dist(i,j) + .001 * r]) & $

    dist(j,i) = dist(i,j) & $

  ENDFOR & $

  dist(i,i) = 0. & $

ENDFOR

CLUSTER_HIERARCHICAL, dist, Clevel=clevel, Clson=clson, $

   Crson=crson

iclus = CLUSTER_NUMBER(clson, crson, k, Obs_per_cluster=nclus)

PRINT,""

PRINT,"     OUTPUT"

PRINT,"   ----------"

PRINT,""

PRINT,"      ICLUS"

PRINT,iclus,Format='(5I3)'

PRINT,""

PRINT,"          NCLUS"

PRINT,nclus,Format='(5I4)'

Output

     OUTPUT

   ----------

      ICLUS

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  5  5  5  5  5

  2  2  2  2  2

  2  2  1  2  2

  1  2  2  2  2

  2  2  2  2  2

  2  2  2  2  2

  2  2  2  2  2

  2  2  2  2  2

  2  2  2  2  2

  2  2  2  1  2

  2  2  2  1  2

  2  2  2  2  2

  2  3  2  2  2

  2  2  2  2  2

  2  2  4  2  2

  2  2  2  2  2

  2  2  2  2  2

  2  4  2  2  2

  2  2  2  2  2

  2  2  2  2  2

  2  2  2  2  2

          NCLUS

   4  93   1   2  50