DISCR_ANALYSIS Procedure
Performs a linear or a quadratic discriminant function analysis among several known groups.
Usage
DISCR_ANALYSIS, x, n_groups
Input Parameters
x—Two-dimensional array of size n_rows by n_variables + 1 containing the data where n_rows = N_ELEMENTS(x(*,0)), the number of rows to be processed and n_variables = number of variables to be used in the discrimination. The first n_variables columns correspond to the variables, and the last column contains the group numbers. The groups must be numbered 1, 2, ..., n_groups.
n_groups—Number of groups in the data.
Input Keywords
Double—If present and nonzero, double precision is used.
Idx_Cols—One-dimensional array containing the indices of the variables to be used in the analysis.
Idx_Vars—Three element array indicating the column numbers of x in which particular types of data are stored. Columns are numbered 0 ... N_ELEMENTS(Idx_Cols) 1.
Idx_Vars(0) contains the index for the column of x in which the group numbers are stored.
Idx_Vars(1) and Idx_Vars(2) contain the column numbers of x in which the frequencies and weights, respectively, are stored. Set Idx_Vars(1) = 1 if there will be no column for frequencies. Set Idx_Vars(2) = 1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.
Defaults: Idx_Cols = 0, 1, ..., n_variables – 1,
Idx_Vars(0) = n_variables,
Idx_Vars(1) = 1, and
Idx_Vars(2) = 1
Method—Method of discrimination. The method chosen determines whether linear or quadratic discrimination is used, whether the group covariance matrices are computed (the pooled covariance matrix is always computed), and whether the leaving-out-one or the reclassification method is used to classify each observation. The Method values are shown in Table 10-2: Method Values.
 
Method Values
Method
discrimination
method
covariances
computed
classification
method
1
linear
pooled, group
reclassification
2
quadratic
pooled, group
reclassification
3
linear
pooled
reclassification
4
linear
pooled, group
leaving-out-one
5
quadratic
pooled, group
leaving-out-one
6
linear
pooled
leaving-out-one
In the leaving-out-one method of classification, the posterior probabilities are adjusted so as to eliminate the effect of the observation from the sample statistics prior to its classification. In the classification method, the effect of the observation is not eliminated from the classification function. Default: Method = 1
Prior_Equal—By default, (or if Prior_Equal is used), equal prior probabilities are calculated as 1.0/n_groups. Keywords Prior_Equal, Prior_Prop, and Prior_Input must not be used together.
Prior_Prop—If present, prior probabilities are calculated to be proportional to the sample size in each group. Keywords Prior_Prop, Prior_Equal, and Prior_Input must not be used together.
Prior_Input—If present, an array of length n_groups containing the prior probabilities for each group, such that the sum of all prior probabilities is equal to 1.0. Keywords Prior_Input, Prior_Equal, and Prior_Prop must not be used together.
Output Keywords
Prior_Output—Named variable into which an one-dimensional array of length n_groups containing the most recently calculated or input prior probabilities is stored.
Group_Counts—Named variable into which an one-dimensional integer array of length n_groups containing the number of observations in each group is stored.
Means—Named variable into which a two-dimensional array of size n_groups by n_variables containing the variable means is stored. The ith row of means contains the group i variable means.
Covariances—Named variable into which a three-dimensional array of size g by n_variables by n_variables containing covariance results is stored. The within-group covariance matrices (Method 1, 2, 4, and 5 only) is the first g-1 matrices, and the pooled covariance matrix is the g-th matrix.
Coefficients—Named variable into which a two-dimensional array of size n_groups by (n_variables + 1) containing the linear discriminant coefficients is stored. The first column of Coefficients contains the constant term, and the remaining columns contain the variable coefficients. Row i – 1 of
Coefficients corresponds to group i, for i = 1, 2, ..., n_variables + 1. Array Coefficients are always computed as the linear discriminant function coefficients even when quadratic discrimination is specified.
Class_Member—Named variable into which an one-dimensional integer array of length n_rows containing the group to which the observation was classified is stored.
If an observation has an invalid group number, frequency, or weight when the leaving-out-one method has been specified, then the observation is not classified and the corresponding elements of Class_Member (and Prob, see Prob below) are set to zero.
Class_Table—Named variable into which a two-dimensional array of size n_groups by n_groups containing the classification table is stored. Each observation that is classified and has a group number 1.0, 2.0, ..., n_groups is entered into the table. The rows of the table correspond to the known group membership. The columns refer to the group to which the observation was classified.
Prob—Named variable into which a two-dimensional array of size n_rows by n_groups containing the posterior probabilities for each observation is stored.
Mahalanobis—Named variable into which a two-dimensional array of size n_groups by n_groups containing the Mahalanobis distances:
between the group means is stored.
For linear discrimination, the Mahalanobis distance is computed using the pooled covariance matrix. Otherwise, the Mahalanobis distance:
between group means i and j is computed using the within covariance matrix for group i in place of the pooled covariance matrix.
Stats—Named variable into which an one-dimensional array of length 4 + 2 * (n_groups + 1) containing various statistics of interest is stored. The first element of Stats is the sum of the degrees of freedom for the within-covariance matrices. The second, third, and fourth elements of Stats correspond to the chi-squared statistic, its degrees of freedom, and the probability of a greater chi-squared, respectively, of a test of the homogeneity of the within-covariance matrices (not computed if Method is equal to 3 or 6). The fifth through 5 + n_groups elements of Stats contain the log of the determinants of each group’s covariance matrix (not computed if Method is equal to 3 or 6) and of the pooled covariance matrix (element 4 + n_groups). Finally, the last n_groups + 1 elements of Stats contain the sum of the weights within each group, and in the last position, the sum of the weights in all groups.
Nmissing—Named variable into which the number of rows of data encountered containing missing values (NaN) for the classification, group, weight, and/or frequency variables is stored. If a row of data contains a missing value (NaN) for any of these variables, that row is excluded from the computations.
Comments
1. Common choices for the Bayesian prior probabilities are given by:
Prior_Input(i) = 1.0/n_groups (equal priors)
Prior_Input(i) = Group_Count/n_rows (proportional priors)
Prior_Input(i) = Past history or subjective judgment.
In all cases, the priors should sum to 1.0.
Discussion
DISCR_ANALYSIS performs discriminant function analysis using either linear or quadratic discrimination. The output includes a measure of distance between the groups, a table summarizing the classification results, a matrix containing the posterior probabilities of group membership for each observation, and the within-sample means and covariance matrices. Linear discriminant function coefficients are also computed.
Covariance matrices are defined as follows: Let Ni denote the sum of frequencies of observations in group i and Mi denote the number of observations in group i. Then, if Si denotes the within-group i covariance matrix:
Where wj is the weight of the jth observation in group i, fj is the frequency, xj is the jth observation column vector (in group i), and:
denotes the mean vector of the observations in group i. The mean vectors are computed as:
Given the means and the covariance matrices, the linear discriminant function for group i is computed as:
where ln (pi) is the natural log of the prior probability for the ith group, x is the observation to classify, and Sp denotes the pooled covariance matrix.
Let S denote either the pooled covariance matrix of one of the within-group covariance matrices Si. (S will be the pooled covariance matrix in linear discrimination, and Si otherwise.) The Mahalanobis distance between group i and group j is computed as:
Finally, the asymptotic chi-squared test for the equality of covariance matrices is computed as follows (Morrison 1976, p. 252):
where ni is the number of degrees of freedom in the ith sample covariance matrix, k is the number of groups, and:
where p is the number of variables.
The estimated posterior probability of each observation x belonging to group is computed using the prior probabilities and the sample mean vectors and estimated covariance matrices under a multivariate normal assumption. Under quadratic discrimination, the within-group covariance matrices are used to compute the estimated posterior probabilities. The estimated posterior probability of an observation x belonging to group i is:
where:
For leaving-out-one method of classification (Method equal to 4, 5 or 6), the sample mean vector and sample covariance matrices in the formula for:
are adjusted so as to remove the observation x from their computation. For linear discrimination (Method equal to 1, 2, 4, or 6), the linear discriminant function coefficients are actually used to compute the same posterior probabilities.
Using the posterior probabilities, each observation in x is classified into a group; the result is tabulated in the array Class_Table and saved in the array Class_Member. Array Class_Table is not altered at this stage if x(i)(Idx_Vars(0)) contains a group number that is out of range. If the reclassification method is specified, then all observations with no missing values in the n_variables classification variables are classified. When the leaving-out-one method is used, observations with invalid group numbers, weights, frequencies, or classification variables are not classified. Regardless of the frequency, a 1 is added (or subtracted) from Class_Table for each row of x that is classified and contains a valid group number.
When Method > 3, adjustment is to the posterior probabilities to remove the effect of the observation in the classification rule. In this adjustment, each observation is presumed to have a weight of x(i)(Idx_Vars(2)) if Idx_Vars(2) > 1 (and a weight of 1.0 if Idx_Vars(2) = 1), and a frequency of 1.0. See Lachenbruch (1975, p. 36) for required adjustment.
The covariance matrices are computed from their LU factorizations.
Example
The following example uses liner discrimination with equal prior probabilities on Fisher’s (1936) iris data.
 
note
To run this example, use the .RUN command and then copy and paste this example into PV‑WAVE.
PRO print_results, counts, table, d2, prior_out, coef, means, $
   cov, stats, nrmiss
   num  =  INDGEN(3)
   PRINT, '      Counts'
   PRINT, num + 1, Format = '(3I5)'
   PRINT, counts, Format = '(3I5)'
   PRINT
   PRINT, '        Table'
   PRINT, num + 1, Format = '(2X, 3I5)'
   FOR i=0L, 2 DO $
      PRINT, num(i) + 1, table(i, *), Format = '(I2, 3I5)'
   PRINT
   PRINT, '           D2'
   PRINT, num + 1, Format = '(3I7)'
   FOR i=0L, 2 DO $
      PRINT, num(i) + 1, d2(i, *), Format = '(I2, 3F7.1)'
   PRINT
   PRINT, '          Prior OUT'
   PRINT, num + 1, Format = '(3I10)'
   PRINT, prior_out, Format = '(3F10.4)'
   PRINT
   num  =  INDGEN(5)
   PRINT, '                         Coef'
   PRINT, num + 1, Format = '(1X, 5I10)
   FOR i=0L, 2 DO $
      PRINT, num(i) + 1, coef(i, *), Format = '(I2, 5F10.1)'
   PRINT
   num  =  INDGEN(4)
   PRINT, '                  Means'
   PRINT, num + 1, Format = '(4I10)'
   FOR i=0L, 2 DO $
      PRINT, num(i) + 1, means(i, *), Format = '(I2, 4F10.3)'
   PRINT
   PRINT, '             Covariance'
   PRINT, num + 1, Format = '(4I10)'
   FOR i=0L, 3 DO $
      PRINT, num(i) + 1, cov(0, *, i), Format = '(I2, 4F10.4)'
   PRINT
   num  =  INDGEN(12)
   PRINT, '           Stats'
   FOR i=0L, 11 DO $
      PRINT, num(i) + 1, stats(i)
   PRINT
   PRINT, 'nrmiss = ', nrmiss
END
 
idxv  =  [1, 2, 3, 4]
idxc  =  [0, -1, -1]
n_groups  =  3
method  =  3
; Retrieve the Fisher Iris Data Set
x  =  STATDATA(3)
DISCR_ANALYSIS, x, n_groups, Idx_Vars = idxv, $
   Idx_cols = idxc, Method = method, /Prior_Equal, $
   Prior_Output = prior_out, Group_Counts = counts, $
   Means = means, Covariances = cov, $
   Coefficients = coef, Class_Member = cm, $
   Class_Table = table, Prob = prob, $
   Mahalanobis = d2, Stats = stats, Nmissing = nrmiss
print_results, counts, table, d2, prior_out, coef, means, $
   cov, stats, nrmiss
This results in the following output:
       Counts
    1    2    3
   50   50   50
 
        Table
      1    2    3
 1   50    0    0
 2    0   48    2
 3    0    1   49
 
           D2
      1      2      3
 1    0.0   89.9  179.4
 2   89.9    0.0   17.2
 3  179.4   17.2    0.0
 
          Prior OUT
         1         2         3
    0.3333    0.3333    0.3333
 
                         Coef
          1         2         3         4         5
 1     -86.3      23.5      23.6     -16.4     -17.4
 2     -72.9      15.7       7.1       5.2       6.4
 3    -104.4      12.4       3.7      12.8      21.1
 
                  Means
         1         2         3         4
 1     5.006     3.428     1.462     0.246
 2     5.936     2.770     4.260     1.326
 3     6.588     2.974     5.552     2.026
 
             Covariance
         1         2         3         4
 1    0.2650    0.0927    0.1675    0.0384
 2    0.0927    0.1154    0.0552    0.0327
 3    0.1675    0.0552    0.1852    0.0427
 4    0.0384    0.0327    0.0427    0.0419
 
           Stats
       1       147.000
       2       1.#QNAN
       3       1.#QNAN
       4       1.#QNAN
       5       1.#QNAN
       6       1.#QNAN
       7       1.#QNAN
       8      -9.95854
       9       50.0000
      10       50.0000
      11       50.0000
      12       150.000
 
nrmiss =            0
Warning Errors
STAT_BAD_OBS_1In call #, row # of the data matrix, “x”, has group number = #. The group number must be an integer between 1.0 and “n_groups” = #, inclusively. This observation will be ignored.
STAT_BAD_OBS_2The leaving-out-one method is specified but this observation does not have a valid group number (Its group number is #.). This observation (row #) is ignored.
STAT_BAD_OBS_3The leaving-out-one method is specified but this observation does not have a valid weight or it does not have a valid frequency. This observation (row #) is ignored.
STAT_COV_SINGULAR_3The group # covariance matrix is singular. “Stats(1)” cannot be computed. “Stats(1)” and “Stats(3)” are set to the missing value code (NaN).
Fatal Errors
STAT_COV_SINGULAR_1The variance-covariance matrix for population number # is singular. The computations cannot continue.
STAT_COV_SINGULAR_2The pooled variance-covariance matrix is singular. The computations cannot continue.
STAT_COV_SINGULAR_4A variance-covariance matrix is singular. The index of the first zero element is equal to #.