Data Mining Introduction

PV-WAVE Advantage > IMSL Statistics Reference Guide > Data Mining > Data Mining Introduction

Data mining is a collection of statistical and analytical methods for extracting useful information from large databases. The problem of extracting information from large databases is common to government, industry, engineering and sciences.

Data Filtering

The first step in this process is to filter data from its raw form into formats required by sophisticated analytical algorithms.

Data fall into two major categories: continuous and categorical. Many algorithms, such as neural network forecasting, perform better if continuous data are mapped into a common scale. The SCALE_FILTER Function implements several techniques for automatically scaling continuous data, including several variations of z-score scaling. If the continuous data represent a time series, the TIME_SERIES_FILTER Function and the TIME_SERIES_CLASS_FILTER Function can be used to create a matrix of lagged values required as input to forecasting neural networks.

Categorical data must also be mapped into a corresponding numerical representation before they can be used in solving forecasting and classification problems. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color. The UNSUPERVISED_ORDINAL_FILTER Procedure encodes and decodes ordinal data into the range [0, 1] using cumulative percentages. The UNSUPERVISED_NOMINAL_FILTER Function uses binary encoding to map nominal data into a matrix of zeros and ones.

Genetic Algorithms

The original genetic algorithm is generally attributed to John Holland and his students from the University of Michigan. During the 1970s they investigated the use of concepts in genetics and biology in optimizing a function. Since that original work, many variations of the original algorithm have been developed by pioneers working in the interface between genetics, computer science and statistics to solve complex problems. These include traditional optimization and search problems in engineering, decision making, game solutions, engineering and pattern recognition.

The genetic algorithm operates on a population of individuals designed to represent the problem being solved. Each individual is rated according to a fitness function designed to measure how close that individual is to solving the problem. For optimization problems, the fitness function is usually constructed from the function being optimized. For other problems, the fitness function can be more complex defined only by the algorithm being investigated. A chess program, for example, might use a fitness function that scores the quality of a board position represented by each individual.

The solution represented by each individual in a population is encoded into the individual chromosome. The fitness function calculates a fitness value for each individual from the information in the individual chromosome. An investor might search for the best set of trading rules for optimizing the returns from the individual investment.

In this case, chromosomes would contain encoded representations of different variations of candidate trading rules. One binary bit might indicate whether a particular technical indicator was being used. Another part of the chromosome might be encoded to indicate how that indicator would be used to buy and sell investments. The fitness function would calculate a rate of return for each individual based upon actual historical data.

Several functions are available for building chromosomes and individuals:

GA_CHROMOSOME Function—Creates the structure for a chromosome.

GA_INDIVIDUAL Function—Creates an individual using an existing chromosome.

Solving any problem using a genetic algorithm always begins by creating a chromosome used for representing the problem. Four data types can be represented in a chromosome: binary, nominal, integer and real, or continuous attributes. Binary attributes are mapped directly into a chromosome as zeros and ones. A nominal attribute is represented as integers 0, 1, ..., k – 1, where k is the maximum number of classes for that attribute. Integer and real attributes are mapped into a binary representation by dividing the range of the attribute into a finite number of subintervals. The range and number of intervals is supplied by the user when the chromosome is constructed. Either base-2 or Gray encoding can be used to encode integer and real attributes.

By default, encoding and decoding of chromosomes is automatic. That is each individual not only carries the chromosome but it also carries the original phenotype values encoded in the chromosome. Before the fitness of an individual is evaluated by calling the user’s fitness function, the information in its chromosome is decoded into phenotype values. If this is too time consuming, automatic encoding and decoding can be disabled and done within the fitness functions. The GA_ENCODE Function and GA_DECODE Function have been provided to encode and decode the chromosome of individuals, if needed. The GA_MUTATE Function has been provided to allow users to create their own genetic algorithm instead of using the GENETIC_ALGORITHM Function.

The genetic algorithm implemented in the GENETIC_ALGORITHM Function evolves an initial population of individuals through several generations, searching for the optimum individuals. The initial population can be created using one of several functions. The easiest approach is to create a population of randomly selected individuals using the GA_RANDOM_POPULATION Function. However, in some cases it might be better to initialize the population using an array of individuals selected based upon their fitness or diversity. The GA_POPULATION Function can create a population data structure from an array of individuals.

Two populations can be merged using the GA_MERGE_POPULATION Function and individuals can be added to an existing population using the GA_GROW_POPULATION Function.

The actual search or optimization using an initial population is conducted using the GENETIC_ALGORITHM Function. This function returns the fittest individual found during the search. Also available are convergence statistics, including generation statistics, and the final population. This information can be used to evaluate the quality of the solution and if an additional search is warranted, the final population might be used as an initial population for that search.

For a detailed overview, see Genetic Algorithms Overview.

Naive Bayes

Naive Bayes is a classification algorithm. First a classifier is trained using the NAIVE_BAYES_TRAINER Function. Once this is done the NAIVE_BAYES_CLASSIFICATION Function can be used to classify patterns with unknown classifications using the trained classifier represented in an Imsls_f_naive_bayes data structure.

Classification problems can be solved using other algorithms such as discriminant analysis and neural networks. In general these alternatives have smaller classification error rates, but they are too slow for large classification problems. During training the NAIVE_BAYES_TRAINER Function uses the non-missing training data to estimate two-way correlations among the attributes. Higher order correlations are assumed to be zero. This can increase the classification error rate, but it significantly reduces the time needed to train the classifier.

In addition, the Naive Bayes algorithm is the only classification algorithm that can handle data with missing values. Other algorithms such as discriminant analysis do not allow missing values in the data. This is a significant limitation for applying other techniques to a larger database.

For a detailed overview, see Naive Bayes Overview.

Neural Networks

Neural networks can be used for forecasting and classification. A neural network data structure is first created using the MLFF_NETWORK_INIT Function and the MLFF_NETWORK Function. Although forecasting and classification neural networks are initialized and represented using the same data structure, separate functions are provided for forecasting and classification in order to make them easier to use and to reduce network training times.

Once the network architecture is created, the network can be trained using the MLFF_NETWORK_TRAINER Function for forecasting problems and the MLFF_CLASSIFICATION_TRAINER Function for classification problems. By default these algorithms initialize the network weights, but weight initialization can be controlled using the MLFF_INITIALIZE_WEIGHTS Function.

Once a network is trained either the MLFF_NETWORK_FORECAST Function or the MLFF_PATTERN_CLASSIFICATION Function is used to produce forecasts or to classify unknown patterns.

For a detailed overview, see Neural Networks Overview.