NAIVE_BAYES_TRAINER Function

PV-WAVE Advantage > IMSL Statistics Reference Guide > Data Mining > NAIVE_BAYES_TRAINER Function

Trains a Naive Bayes classifier.

Usage

result = NAIVE_BAYES_TRAINER (n_classes, classification)

Input Parameters

n_classes—A scalar long indicating the number of target classifications.

classification—Array of size n_patterns, which is the number of training patterns, containing the target classifications for the training patterns. These must be encoded from zero to n_classes – 1. Any value outside this range is considered a missing value. In this case, the data in that pattern are not used to train the Naive Bayes classifier. However, any pattern with missing values is still classified after the classifier is trained.

Returned Value

result—An array of size (n_classes+1) by 2 containing the number of classification errors and the number of non-missing classifications for each target classification plus the overall totals for these errors. For i < n_classes, the ith row contains the number of classification errors for the ith class and the number of patterns with non-missing classifications for that class. The last row contains the number of classification errors totaled over all target classifications and the total number of patterns with non-missing target classifications.

If training is unsuccessful, null is returned.

Input Keywords

Double—If present and nonzero, then double precision is used.

Continuous—An array of size n_patterns by n_continuous containing the training values for the continuous attributes. Where n_continuous is the number of continuous attributes and n_patterns is the number of training patterns. The ith row contains the input attributes for the ith training pattern. The jth column of Continuous contains the values for the jth continuous attribute. Missing values should be set equal to MACHINE(NAN)=NaN. Patterns with both non-missing and missing values are used to train the classifier unless the Ignoremissing argument is supplied. If the Continuous keyword is not supplied, n_continuous is assumed equal to zero.

N_categories—An array of length n_nominal containing the number of categories associated with each nominal attribute, where n_nominal is the number of nominal attributes. These must all be greater than zero.

Nominal—An array of size n_patterns by n_nominal containing the training values for the nominal attributes. The ith row contains the nominal input attributes for the ith pattern. The jth column of this matrix contains the classifications for the jth nominal attribute. The values for the jth nominal attribute are expected to be encoded with integers starting from 0 to N_categories(i) – 1. Any value outside this range is treated as a missing value. Patterns with both non-missing and missing values are used to train the classifier unless the Ignoremissing option is supplied. If the Nominal keyword is not supplied, n_nominal is assumed to be zero.

Print_level—A scalar long or enumeration value from Table 14-8: Print_level Values indicating the level of data warnings and final results detail to print. Print_level accepts the following values:

Print_level Values
Level	Enumeration	Description
0	IMSLS_NONE	Printing of data warnings and final results is suppressed.
1	IMSLS_FINAL	Prints final summary of Naive Bayes classifier training.
2	IMSLS_DATA_WARNINGS	Prints information about missing values and PDF calculations equal to zero.
3	IMSLS_TRACE_ALL	Prints final summary plus all data warnings associated with missing values and PDF calculations equal to zero.

Default: Print_level = 0.

Ignoremissing—By default, patterns with both missing and non-missing values are used to train the classifier. This option causes the algorithm to ignore patterns with one or more missing input attributes during training. However, classification predictions are still returned for all patterns.

Discrete_smooth—A scalar float parameter for calculating smoothed estimates of conditional probabilities for discrete attributes. This parameter must be non-negative. Default: Laplace smoothing of conditional probabilities, i.e., Discrete_smooth = 1.

Cont_smooth—A scalar float parameter for calculating smoothed estimates of conditional probabilities for continuous attributes. This parameter must be non-negative. Default: No smoothing of conditional probabilities for continuous attributes, i.e., Cont_smooth = 0.

Zero_correction—A scalar float parameter used to replace conditional probabilities equal to zero numerically. This parameter must be non-negative. Default: No correction, i.e., Zero_correction = 0.

Selected_pdf—An array of length n_continuous specifying the distribution for each continuous input attribute. Selected_pdf(i) specifies the probability density function for the ith continuous input attribute. If this keyword is not supplied, conditional probabilities for all continuous attributes are calculated using the Gaussian probability density function with its parameters estimated from the training patterns, i.e., Selected_pdf(i) = IMSLS_GAUSSIAN. This keyword allows users to select other distributions using the following encoding:

Selected_pdf Values
Value	Selected_pdf(i)	Probability Density Function
0	IMSLS_GAUSSIAN	Gaussian. See the Gauss_means and Gauss_stdev keywords for an explanation.
1	IMSLS_LOG_NORMAL	Log-normal. See the Log_means and Log_stdev keywords for an explanation.
2	IMSLS_GAMMA	Gamma. See the Gamma_a and Gamma_b keywords for an explanation.
3	IMSLS_POISSON	Poisson. See the Poisson_pdf keyword for an explanation.
5	IMSLS_USER	User Defined. See the User_pdf keyword for an explanation.

Gauss_means—An array of size n_gauss by n_classes where n_gauss represents the number of Gaussian attributes as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_GAUSSIAN). The ith row of Gauss_means contains the means for the ith Gaussian attribute in Continuous for each value of the target classification. Gauss_means(i*n_classes+j) is used as the mean for the ith Gaussian attribute when the target classification equals j. This keyword is ignored if n_continuous = 0. Default: The means and standard deviations for all Gaussian attributes are estimated from the means and standard deviations of the the training patterns. These estimates are the traditional BLUE (Best Linear Unbiased Estimates) for the parameters of a Gaussian distribution. NOTE: This keyword must be used in conjunction with Gauss_stdev.

Gauss_stdev—An array of size n_gauss by n_classes where n_gauss represents the number of Gaussian attributes as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_GAUSSIAN). The ith row of Gauss_stdev contains the standard deviations for the ith Gaussian attribute in Continuous for each value of the target classification. Gauss_stdev(i*n_classes+j) is used as the standard deviation for the ith Gaussian attribute when the target classification equals j. This keyword is ignored if n_continuous = 0. Default: The means and standard deviations for all Gaussian attributes are estimated from the means and standard deviations of the the training patterns. These estimates are the traditional BLUE (Best Linear Unbiased Estimates) for the parameters of a Gaussian distribution. NOTE: This keyword must be used in conjunction with Gauss_means.

Log_means—An array of size n_logNormal by n_classes where n_logNormal represents the number of log-normal attributes as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_LOG_NORMAL). The ith row of Log_means contains the means for the ith log-normal attribute for each value of the target classification. Log_means(i*n_classes+j) is used as the mean for the ith log-normal attribute when the target classification equals j. This keyword is ignored if n_continuous = 0. Default: The means and standard deviations for all log-normal attributes are estimated from the means and standard deviations of the the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a log-normal distribution. NOTE: This keyword must be used in conjunction with Log_stdev.

Log_stdev—An array of size n_logNormal by n_classes where n_logNormal represents the number of log-normal attributes as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_LOG_NORMAL). The ith row of Log_stdev contains the standard deviations for the ith log-normal attribute for each value of the target classification. Log_stdev(i*n_classes+j) is used as the standard deviation for the ith log-normal attribute when the target classification equals j. This keyword is ignored if n_continuous = 0. Default: The means and standard deviations for all log-normal attributes are estimated from the means and standard deviations of the the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a log-normal distribution. NOTE: This keyword must be used in conjunction with Log_means.

Gamma_a—An array of size n_gamma by n_classes containing the means and standard deviations for the Gamma continuous attributes, where n_gamma represents the number of gamma distributed continuous variables as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_GAMMA). The ith row of Gamma_a contains the shape parameter for the ith Gamma attribute for each value of the target classification. Gamma_a(i*n_classes+j) is used as the shape parameter for the ith Gamma attribute when the target classification equals j. This keyword is ignored if n_continuous = 0. Default: The shape and scale parameters for all Gamma attributes are estimated from the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a Gamma distribution. NOTE: This keyword must be used in conjunction with Gamma_b.

Gamma_b—An array of size n_gamma by n_classes containing the means and standard deviations for the Gamma continuous attributes, where n_gamma represents the number of gamma distributed continuous variables as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_GAMMA). The ith row of Gamma_b contains the scale parameter for the ith Gamma attribute for each value of the target classification. Gamma_b(i*n_classes+j) is used as the scale parameter for the ith Gamma attribute when the target classification equals j. This keyword is ignored if n_continuous = 0. Default: The shape and scale parameters for all Gamma attributes are estimated from the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a Gamma distribution. NOTE: This keyword must be used in conjunction with Gamma_a.

Poisson_pdf—An integer array of size n_poisson by n_classes containing the means for the Poisson attributes, where n_poisson represents the number of Poisson distributed continuous variables as specified by the keyword Selected_pdf (i.e., the number of elements in Selected_pdf equal to IMSLS_POISSON).The ith row of Poisson_pdf contains the means for the ith Poisson attribute for each value of the target classification. Poisson_pdf(i*n_classes+j) is used as the mean for the ith Poisson attribute when the target classification equals j. This argument is ignored if n_continuous= 0. Default: The means (Poisson_pdf) for all Poisson attributes are estimated from the means of the the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a Poisson distribution.

User_pdf—The user-supplied probability density function and parameters used to calculate the conditional probability density for continuous input attributes is required when Selected_pdf(i)= IMSLS_USER. User_pdf is defined as User_pdf(index, x), where x equals continuous(i*n_continuous+j), and index is an array of length 3 which contains the following values for i, j, and k:

index(0)—i = pattern index

index(1)—j = attribute index

index(2)—k = target classification

The pattern index ranges from 0 to n_patterns – 1 and identifies the pattern index for x. The attributes index ranges from 0 to n_categories(i) – 1, and k=classification(i). This keyword is ignored if n_continuous = 0. By default the Gaussian PDF is used for calculating the conditional probability densities using either the means and variances calculated from the training patterns or those supplied in Gauss_means and Gauss_stdev.

Output Keywords

Means_out—An array of size n_continuous by n_classes containing the means for the continuous attributes segmented by the target classes. The structure of these matrices is identical to the structure described for the Gauss_means keyword. The ith row of Means_out contains the means of the ith continuous attribute for each value of the target classification. That is, Means_out(i*n_classes+j) is the mean for the ith continuous attribute when the target classification equals j, unless there are no training patterns for this condition. If there are no training patterns in the i, jth cell then the mean for that cell is computed using the mean for the ith continuous attribute calculated using all of its non-missing values.

Stdev_out—An array of size n_continuous by n_classes containing the standard deviations for the continuous attributes segmented by the target classes. The structure of these matrices is identical to the structure described for the Gauss_stdev keyword. The ith row of Stdev_out contains the standard deviations of the ith continuous attribute for each value of the target classification. That is, Stdev_out(i*n_classes+j) is the standard deviation for the ith continuous attribute when the target classification equals j, unless there are no training patterns for this condition. If there are no training patterns in the i, jth cell then the standard deviation for that cell is computed using the standard deviation for the ith continuous attribute calculated using all of its non-missing values. Standard deviations are estimated using the minimum variance unbiased estimator.

Predicted_class—An array of size n_patterns containing the predicted classification for each training pattern.

Pred_class_prob—An array of size n_patterns by n_classes. The values in the ith row are the predicted classification probabilities associated with the target classes. Pred_class_prob(i*n_classes+j) is the estimated probability that the ith pattern belongs to the jth target class.

Class_error—An array with n_patterns containing the classification probability errors for each pattern in the training data. The classification error for the ith training pattern is equal to 1– Pred_class_prob(i*n_classes+k) where k=classification(i).

Count_table—An array of size:

where m = n_nominal – 1. Count_table(i*n_nominal*n_classes+j*n_classes+k) is equal to the number of training patterns for the ith nominal attribute, when the classification(i)=j and nominal(i*n_classes+j)=k.

Nb_classifier—An nb_classifier structure. Upon return, the structure is populated with the trained Naive Bayes classifier. This is required input to NAIVE_BAYES_CLASSIFICATION Function.

Discussion

NAIVE_BAYES_TRAINER trains a Naive Bayes classifier for classifying data into one of n_classes target classes. Input attributes can be a combination of both nominal and continuous data. Ordinal data can be treated as either nominal attributes or continuous. If the distribution of the ordinal data is known or can be approximated using one of the continuous distributions, then associating them with continuous attributes allows a user to specify that distribution. Missing values are allowed.

Let C be the classification attribute with target categories 0, 1, ..., n_classes-1, and let XT={x1, x2, ..., xk} be a vector valued array of k = n_nominal+n_continuous input attributes. The classification problem simplifies to estimate the conditional probability P(C|X) from a set of training patterns. The Bayes rule states that this probability can be expressed as the ratio:

where c is equal to one of the target classes 0, 1, ..., n_classes – 1. In practice, the denominator of this expression is constant across all target classes since it is only a function of the given values of X. As a result, the Naive Bayes algorithm does not expend computational time estimating P(X = {x1, x2, ..., xk}) for every pattern. Instead, a Naive Bayes classifier calculates the numerator P(C= c)P(X = {x1, x2, ..., xk}|C= c) for each target class and then classifies X to the target class with the largest value, i.e.:

The classifier simplifies this calculation by assuming conditional independence. That is it assumes that:

This is equivalent to assuming that the values of the input attributes, given C, are independent of one another, i.e.,:

P(xi|xj, C= c) = P(xi|C= c), for all i ≠ j

In real world data this assumption rarely holds, yet in many cases this approach results in surprisingly low classification error rates. Thus, the estimate of P(C= c|X = {x1, x2, ..., xk}) from a Naive Bayes classifier is generally an approximation. Classifying patterns based upon the Naive Bayes algorithm can have acceptably low classification error rates.

For nominal attributes, this implementation of the Naive Bayes classifier estimates conditional probabilities using a smoothed estimate:

where #N{Z}is the number of training patterns with attribute Z and j is equal to the number of categories associated with the jth nominal attribute.

The probability P(C= c) is also estimated using a smoothed estimate:

These estimates correspond to the maximum a priori (MAP) estimates for a Dirichelet prior assuming equal priors. The smoothing parameter can be any non-negative value. Setting λ = 0 corresponds to no smoothing. The default smoothing used in this algorithm, λ = 1, is commonly referred to as Laplace smoothing. This can be changed using the keyword Discrete_smooth.

For continuous attributes, the same conditional probability P(xj|C= c) in the Naive Bayes formula is replaced with the conditional probability density function f(xj|C= c). By default, the density function for continuous attributes is the Gaussian density function:

where μ and σ are the conditional mean and variance, i.e., the mean and variance of xj when C = c. By default the conditional mean and standard deviations are estimated using the sample mean and standard deviation of the training patterns. These are returned in the keywords Means_out and Stdev_out.

In addition to the default IMSLS_GAUSSIAN, users can select three other continuous distributions to model the continuous attributes using the argument Selected_pdf. These are the Log Normal, Gamma, and Poisson distributions selected by setting the entries in Selected_pdf to IMSLS_LOG_NORMAL, IMSLS_GAMMA or IMSLS_POISSON. Their probability density functions are equal to:

, xj > 0, a > 0, and b > 0

and:

, xj > 0, θ > 0.

By default parameters for these distributions are estimated from the training patterns using the maximum likelihood method. However, they can also be supplied using the keywords Gauss_means, Gauss_stdev, Log_means, Log_stdev, Gamma_a, Gamma_b, and Poisson_pdf.

The default Gaussian PDF can be changed and each continuous attribute can be assigned a different density function using the argument Selected_pdf. If any entry in Selected_pdf is equal to IMSLS_USER, the user must supply their own PDF calculation using the User_Pdf keyword. Each continuous attribute can be modeled using a different distribution if appropriate.

Smoothing conditional probability calculations for continuous attributes is controlled by the Cont_smooth and Zero_correction keywords. By default conditional probability calculations for continuous attributes are unadjusted for calculations near zero. If the value of Cont_smooth is set, the algorithm adds Cont_smooth to each continuous probability calculation. This is similar to the effect of Discrete_smooth for the corresponding discrete calculations. By default Cont_smooth = 0.

The value of Zero_correction is used when (f(x|C= c) + Cont_smooth) = 0. If this condition occurs, the conditional probability is replaced with the value of Zero_correction. By default Zero_correction = 0.

Example 1

Fisher’s (1936) Iris data is often used for benchmarking classification algorithms. It is one of the IMSL data sets and consists of the following continuous input attributes and classification target:

Continuous Attributes: X0(sepal length), X1(sepal width), X2(petal length), and X3(petal width)

Classification (Iris Type): Setosa, Versicolour, or Virginica.

This example trains a Naive Bayes classifier using 150 training patterns with these data.

PRO naive_bayes_trainer_ex1

  @CMAST_COMMON

  n_patterns    =150 ; 150 training patterns

  n_continuous  =4   ; four continuous input attributes

  n_classes     =3   ; three classification categories

  classification = FLTARR(n_patterns)

  continuous = FLTARR(n_patterns,n_continuous)

       ; irisData[]:  The raw data matrix.  This is a 2-D

       ;    matrix with 150 rows and 5 columns. The last 4

       ;    columns are the continuous input attributes and

       ;    the 1st column is the classification category

       ;    (1-3).  These data contain no nominal input

       ;    attributes.

  irisData = STATDATA(3)

  ; Data corrections described in the KDD data mining archive.

  irisData(34,4) = 0.1

  irisData(37,2) = 3.1

  irisData(37,3) = 1.5

  ; set up the required input arrays from the data matrix.

  classification(*) = irisData(*,0)-1

  continuous(*,0:3) = irisData(*,1:4)

  classErrors = NAIVE_BAYES_TRAINER(n_classes, classification,$

                             Continuous=continuous,$

                             NB_classifier=nb_classifier)

  PRINT,"     Iris Classification Error Rates"

  PRINT,"----------------------------------------------"

  PRINT,"   Setosa  Versicolour  Virginica   |   TOTAL"

  PRINT,STRTRIM(classErrors(0,0),2),"/",$

        STRTRIM(classErrors(0,1),2),"    ",$

        STRTRIM(classErrors(1,0),2),"/",$

        STRTRIM(classErrors(1,1),2),"    ",$

        STRTRIM(classErrors(2,0),2),"/",$

        STRTRIM(classErrors(2,1),2),"    ",$

        STRTRIM(classErrors(3,0),2),"/",$

        STRTRIM(classErrors(3,1),2)

 PRINT,"----------------------------------------------"

END

Output

For Fisher’s data, the Naive Bayes classifier incorrectly classified 6 of the 150 training patterns.

     Iris Classification Error Rates

----------------------------------------------

   Setosa  Versicolour  Virginica   |   TOTAL

    0/50      3/50         3/50     |   6/150

----------------------------------------------

Example 2

This example trains a Naive Bayes classifier using 24 training patterns with four nominal input attributes. It illustrates the output available from the keyword Print_level.

The first nominal attribute has three classifications and the others have three. The target classifications are contact lenses prescription: hard, soft or neither recommended. This data is benchmark data from the Knowledge Discovery Databases archive maintained at the University of California, Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase.

PRO naive_bayes_trainer_ex2

   @CMAST_COMMON

   inputData = $     ; (5,n_patterns)  DATA MATRIX

   [[1,1,1,1,3],[1,1,1,2,2],[1,1,2,1,3],[1,1,2,2,1],$

    [1,2,1,1,3],[1,2,1,2,2],[1,2,2,1,3],[1,2,2,2,1],$

    [2,1,1,1,3],[2,1,1,2,2],[2,1,2,1,3],[2,1,2,2,1],$

    [2,2,1,1,3],[2,2,1,2,2],[2,2,2,1,3],[2,2,2,2,3],$

    [3,1,1,1,3],[3,1,1,2,3],[3,1,2,1,3],[3,1,2,2,1],$

    [3,2,1,1,3],[3,2,1,2,2],[3,2,2,1,3],[3,2,2,2,3]]

   n_patterns    =24 ; 24 training patterns

   n_nominal     =4  ; 2 nominal input attributes

   n_classes     =3  ; three classification categories

   n_categories = [3, 2, 2, 2]

   nominal = LONARR(n_patterns,n_nominal)

   classification = LONARR(n_patterns)

   classLabel = ["Hard   ", "Soft   ", "Neither"]

   inputdata=TRANSPOSE(inputdata)

   ; Set up the required input arrays from the data matrix

   ; subtract 1 from the data to ensure classes start at zero.

   classification(*) = inputData(*,4)-1

   nominal(*,0:3)= inputdata(*,0:3)-1

   classErrors = NAIVE_BAYES_TRAINER(n_classes,$

                               classification,$

                    N_categories=n_categories,$

                              Nominal=nominal,$

                       Print_level=IMSLS_FINAL)

END

Output

For this data, only one of the 24 training patterns is misclassified, pattern 17. The target classification for that pattern is 2 = “Neither”. However, since P(class = 2) = 0.3491 < P(class = 1) = 0.5085, pattern 17 is classified as class = 1, “Soft Contacts” recommended. The classification error for this probability is calculated as 1.0 – 0.3491 = 0.6509.

1.0 - 0.3491 = 0.6509.

--------UNCONDITIONAL TARGET CLASS PROBABILITIES---------

P(Class=0) = 0.1852 P(Class=1) = 0.2222 P(Class=2) = 0.5926

---------------------------------------------------------

----------------CONDITIONAL PROBABILITIES----------------

----------NOMINAL ATTRIBUTE 0 WITH 3 CATEGORIES----------

P(X(0)=0|Class=0)=0.4286 P(X(0)=1|Class=0)=0.2857 P(X(0)=2|Class=0)=0.2857

P(X(0)=0|Class=1)=0.3750 P(X(0)=1|Class=1)=0.3750 P(X(0)=2|Class=1)=0.2500

P(X(0)=0|Class=2)=0.2778 P(X(0)=1|Class=2)=0.3333 P(X(0)=2|Class=2)=0.3889

---------------------------------------------------------

----------NOMINAL ATTRIBUTE 1 WITH 2 CATEGORIES----------

P(X(1)=0|Class=0) = 0.6667 P(X(1)=1|Class=0) = 0.3333

P(X(1)=0|Class=1) = 0.4286 P(X(1)=1|Class=1) = 0.5714

P(X(1)=0|Class=2) = 0.4706 P(X(1)=1|Class=2) = 0.5294

---------------------------------------------------------

----------NOMINAL ATTRIBUTE 2 WITH 2 CATEGORIES----------

P(X(2)=0|Class=0) = 0.1667 P(X(2)=1|Class=0) = 0.8333

P(X(2)=0|Class=1) = 0.8571 P(X(2)=1|Class=1) = 0.1429

P(X(2)=0|Class=2) = 0.4706 P(X(2)=1|Class=2) = 0.5294

---------------------------------------------------------

----------NOMINAL ATTRIBUTE 3 WITH 2 CATEGORIES----------

P(X(3)=0|Class=0) = 0.1667 P(X(3)=1|Class=0) = 0.8333

P(X(3)=0|Class=1) = 0.1429 P(X(3)=1|Class=1) = 0.8571

P(X(3)=0|Class=2) = 0.7647 P(X(3)=1|Class=2) = 0.2353

---------------------------------------------------------

                                              TRAINING PREDICTED  CLASS

PATTERN P(class=0) P(class=1) P(class=2)    CLASS    CLASS    ERROR

---------------------------------------------------------------

    0     0.0436       0.1297       0.8267        2        2     0.1733

    1     0.1743       0.6223       0.2034        1        1     0.3777

    2     0.1863       0.0185       0.7952        2        2     0.2048

    3     0.7238       0.0861       0.1901        0        0     0.2762

    4     0.0194       0.1537       0.8269        2        2     0.1731

    5     0.0761       0.7242       0.1997        1        1     0.2758

    6     0.0920       0.0243       0.8836        2        2     0.1164

    7     0.5240       0.1663       0.3096        0        0     0.4760

    8     0.0253       0.1127       0.8621        2        2     0.1379

    9     0.1182       0.6333       0.2484        1        1     0.3667

   10     0.1132       0.0168       0.8699        2        2     0.1301

   11     0.6056       0.1081       0.2863        0        0     0.3944

   12     0.0111       0.1327       0.8562        2        2     0.1438

   13     0.0500       0.7138       0.2362        1        1     0.2862

   14     0.0535       0.0212       0.9252        2        2     0.0748

   15     0.3937       0.1875       0.4188        2        2     0.5812

   16     0.0228       0.0679       0.9092        2        2     0.0908

   17     0.1424       0.5085       0.3491        2        1     0.6509

   18     0.0994       0.0099       0.8907        2        2     0.1093

   19     0.5986       0.0712       0.3301        0        0     0.4014

   20     0.0101       0.0805       0.9093        2        2     0.0907

   21     0.0624       0.5937       0.3439        1        1     0.4063

   22     0.0467       0.0123       0.9410        2        2     0.0590

   23     0.3909       0.1241       0.4850        2        2     0.5150

---------------------------------------------------------------

CLASSIFICATION ERRORS

Classification 0: 0/4

Classification 1: 0/5

Classification 2: 1/15

Total Errors:    1/24

Example 3

This example illustrates the power of Naive Bayes classification for text mining applications. This example uses the spam benchmark data available from the Knowledge Discovery Databases archive maintained at the University of California, Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase and is one of the IMSL data sets.

These data consist of 4601 patterns consisting of 57 continuous attributes and one classification binary classification attribute. 41% of these patterns are classified as spam and the remaining as non-spam. The first 54 continuous attributes are word or symbol percentages. That is, they are percents scaled from 0 to 100% representing the percentage of words or characters in the email that contain a particular word or character. The last three continuous attributes are word lengths. For a detailed description of these data visit the KDD archive at the above link.

In this example, the program was written to evaluate alternatives for modeling the continuous attributes. Since some are percentages and others are lengths with widely different ranges, the classification error rate can be influenced by scaling. Percentages are transformed using the arcsin/square root transformation

. This transformation often produces a continuous attribute that is more closely approximated by a Gaussian distribution. There are a variety of possible transformations for the word length attributes. In this example, the square root transformation is compared to a classifer with no transformation.

In addition, since this Naive Bayes algorithm allows users to select individual statistical distributions for modeling continuous attributes, the Gaussian and Log Normal distributions are investigated for modeling the continuous attributes.

PRO naive_bayes_trainer_ex3

   ; Define the required contants.

   @CMAST_COMMON

   ; Inputs assuming all attributes, except family history,

   ; are continuous.

   n_patterns      = 4601

   n_variables     = 58      ; 57 + 1 classification

   n_classes       =  2      ; (spam or no spam)

   n_continuous    = 57

   selected_pdf    = LONARR(n_continuous)

   n_spam = 0

   continuous = FLTARR(n_patterns, n_continuous)

   unscaledContinuous = FLTARR(n_patterns, n_continuous)

   classification = LONARR(n_patterns)

   ; Retrieve the data set

   spamData = STATDATA(11)

   unscaledContinuous(*,*) = spamData(*,0:n_variables-2)

   classification(*) = spamData(*,n_variables-1)

   tmp = WHERE(classification EQ 1, n_spam)

   continuous(*,0:53) = ASIN(SQRT(spamData(*,0:53)/100.0))

   continuous(*,54:n_variables-2) = spamdata(*,54:n_variables-2)

   PRINT,"Number of Patterns = ", STRTRIM(n_patterns,2)

   PRINT,"Number Classified as Spam = ", STRTRIM(n_spam,2)

   classErrors = NAIVE_BAYES_TRAINER(n_classes, $

                                 classification,$

                    Continuous=unscaledContinuous)

   PRINT,"    Unscaled Gaussian Classification Error Rates "

   PRINT,"           No Attribute Transformations          "

   PRINT,"     All Attributes Modeled as Gaussian Variates."

   print_error_rates,classErrors

   classErrors = NAIVE_BAYES_TRAINER(n_classes,$

                                classification,$

                          Continuous=continuous)

   PRINT,"    Scaled Gaussian Classification Error Rates  "

   PRINT,"   Arsin(sqrt) transformation of first 54 Vars. "

   PRINT,"   All Attributes Modeled as Gaussian Variates. "

   print_error_rates,classErrors

   selected_pdf(0:53) = IMSLS_GAUSSIAN

   selected_pdf(54:n_continuous-1) = IMSLS_LOG_NORMAL

   classErrors = NAIVE_BAYES_TRAINER(n_classes,$

                                classification,$

                         Continuous=continuous,$

                      Selected_pdf=selected_pdf)

   PRINT,"  Gaussian/Log Normal Classification Error Rates   "

   PRINT,"  Arsin(sqrt) transformation of 1st 54 Attributes. "

   PRINT," Gaussian - 1st 54 & Log Normal - last 3 Attributes"

   print_error_rates,classErrors

   ; Scale continuous attributes using z-score scaling

   continuous(*,54:n_continuous-1) = $

                  SQRT(unscaledContinuous(*,54:n_continuous-1))

   selected_pdf(*) = IMSLS_GAUSSIAN

   classErrors = NAIVE_BAYES_TRAINER(n_classes,$

                                classification,$

                         Continuous=continuous,$

                      Selected_pdf=selected_pdf)

   PRINT,"       Scaled Classification Error Rates         "

   PRINT,"  Arsin(sqrt) transformation of 1st 54 Attributes"

   PRINT,"    sqrt() transformation for last 3 Attributes  "

   PRINT,"   All Attributes Modeled as Gaussian Variates.  "

   print_error_rates,classErrors

   selected_pdf(54:n_continuous-1) = IMSLS_LOG_NORMAL

   classErrors = NAIVE_BAYES_TRAINER(n_classes,$

                                classification,$

                         Continuous=continuous,$

                      Selected_pdf=selected_pdf)

   PRINT,"      Scaled Classification Error Rates"

   PRINT,"  Arsin(sqrt) transformation of 1st 54 Attributes  "

   PRINT,"  and sqrt() transformation for last 3 Attributes  "

   PRINT," Gaussian - 1st 54 & Log Normal - last 3 Attributes"

   print_error_rates,classErrors

END

PRO print_error_rates, classErrors

 IF(SIZE(classErrors,/Ndim) EQ 2) THEN BEGIN

   p0 = 100.0*classErrors(0,0)/classErrors(0,1)

   p1 = 100.0*classErrors(1,0)/classErrors(1,1)

   p2 = 100.0*classErrors(2,0)/classErrors(2,1)

   PRINT,"----------------------------------------------------"

   PRINT,"    Not Spam          Spam        |    TOTAL"

   PRINT," ",STRTRIM(classErrors(0,0),2),"/",$

         STRTRIM(classErrors(0,1),2),"=",$

         STRING(p0,Format="(f4.1)"),"%   ",$

         STRTRIM(classErrors(1,0),2),"/",$

         STRTRIM(classErrors(1,1),2),"=",$

         STRING(p1,Format="(f4.1)"),"%   | ",$

         STRTRIM(classErrors(2,0),2),"/",$

         STRTRIM(classErrors(2,1),2),"=",$

         STRING(p2,Format="(f4.1)"),"%"

   PRINT,"----------------------------------------------------"

 ENDIF ELSE BEGIN

   p0 = 100.0*classErrors(0)/classErrors(1)

   p1 = 100.0*classErrors(2)/classErrors(3)

   p2 = 100.0*classErrors(4)/classErrors(5)

   PRINT,"----------------------------------------------------"

   PRINT,"    Not Spam          Spam        |    TOTAL"

   PRINT," ",STRTRIM(classErrors(2),2),"/",$

         STRTRIM(classErrors(1),2),"=",$

         STRING(p0,Format="(f4.1)"),"%   ",$

         STRTRIM(classErrors(0),2),"/",$

         STRTRIM(classErrors(3),2),"=",$

         STRING(p1,Format="(f4.1)"),"%   | ",$

         STRTRIM(classErrors(4),2),"/",$

         STRTRIM(classErrors(5),2),"=",$

         STRING(p2,Format="(f4.1)"),"%"

   PRINT,"----------------------------------------------------"

 ENDELSE

END

Output

If the continuous attributes are left untransformed and modeled using the Gaussian distribution, the overall classification error rate is 18.4% with most of these occurring when spam is classified as “not spam.” The error rate for correctly classifying non-spam is 26.6%.

The lowest overall classification error rate occurs when the percentages are transformed using the arc-sin/square root transformation and the length attributes are untransformed using logs. Representing the transformed percentages as Gaussian attributes and the transformed lengths as log-normal attributes reduces the overall error rate to 14.2%. However, although the error rate for correctly classifying non-spam email is low for this case, the error rate for correctly classifying spam is high, about 28%.

In the end, the best model to identify spam may depend upon which type of error is more important, incorrectly classifying non-spam email or incorrectly classifying spam.

    Data File Opened Successfully

Number of Patterns = 4601

Number Classified as Spam = 1813

    Unscaled Gaussian Classification Error Rates

           No Attribute Transformations

     All Attributes Modeled as Gaussian Variates.

----------------------------------------------------

    Not Spam          Spam        |    TOTAL

 743/2788=26.6%   102/1813= 5.6%  | 845/4601=18.4%

----------------------------------------------------

    Scaled Gaussian Classification Error Rates

   Arsin(sqrt) transformation of first 54 Vars.

   All Attributes Modeled as Gaussian Variates.

----------------------------------------------------

    Not Spam          Spam        |    TOTAL

 84/2788= 3.0%   508/1813=28.0%   | 592/4601=12.9%

----------------------------------------------------

  Gaussian/Log Normal Classification Error Rates

  Arsin(sqrt) transformation of 1st 54 Attributes.

 Gaussian - 1st 54 & Log Normal - last 3 Attributes

----------------------------------------------------

    Not Spam          Spam        |    TOTAL

 81/2788= 2.9%   519/1813=28.6%   | 600/4601=13.0%

----------------------------------------------------

       Scaled Classification Error Rates

  Arsin(sqrt) transformation of 1st 54 Attributes

    sqrt() transformation for last 3 Attributes

   All Attributes Modeled as Gaussian Variates.

----------------------------------------------------

    Not Spam          Spam        |    TOTAL

 74/2788= 2.7%   595/1813=32.8%   | 669/4601=14.5%

----------------------------------------------------

      Scaled Classification Error Rates

  Arsin(sqrt) transformation of 1st 54 Attributes

  and sqrt() transformation for last 3 Attributes

 Gaussian - 1st 54 & Log Normal - last 3 Attributes

----------------------------------------------------

    Not Spam          Spam        |    TOTAL

 73/2788= 2.6%   602/1813=33.2%   | 675/4601=14.7%

----------------------------------------------------