CHISQTEST Function
Performs a chi-squared goodness-of-fit test.
Usage
result = CHISQTEST(f, n_categories, x)
Input Parameters
f—Scalar string specifying a user-supplied function. Function f accepts one scalar parameter and returns the hypothesized, cumulative distribution function at that point.
n_categories—Number of cells into which the observations are to be tallied.
x—One-dimensional array containing the vector of data elements for this test.
Returned Value
result—The p-value for the goodness-of-fit chi-squared statistic.
Input Keywords
Double—If present and nonzero, double precision is used.
N_Params_Estimated—Number of parameters estimated in computing the cumulative distribution function.
Equal_Cutpoints—If present and nonzero, equal probability cutpoints are used. Keyword Equal_Cutpoints should not be used if Cutpoints is present.
Cutpoints—Specifies the named variable containing user-defined cutpoints to be used by CHISQTEST. Keywords Cutpoints and Equal_Cutpoints cannot be used together.
Frequencies—Named variable into which the array containing the vector frequencies for the observations stored in x is stored.
Lower_Bound—Lower bound of the range of the distribution. If Lower Bound = Upper Bound, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside of the range of these bounds are ignored. Distributions conditional on a range can be specified when Lower_Bound and Upper_Bound are used. If Lower_Bound is specified, then Upper_Bound also must be specified. By convention, Lower_Bound is excluded from the first interval, but Upper_Bound is included in the last interval.
Upper_Bound—Upper bound of the range of the distribution. If Lower Bound = Upper Bound, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside of the range of these bounds are ignored. Distributions conditional on a range can be specified when Lower_Bound and Upper_Bound are used. If Upper_Bound is specified, then Lower_Bound also must be specified. By convention, Lower_Bound is excluded from the first interval, but Upper_Bound is included in the last interval.
Output Keywords
Used_Cutpoints—Specifies the named variable into which the cutpoints to be used by CHISQTEST are stored.
Chi_Squared—Named variable into which the chi-squared test statistic is stored.
Df—Named variable into which the degrees of freedom for the chi-squared goodness-of-fit test are stored.
Cell_Counts—Named variable into which the cell counts are stored. The cell counts are the observed frequencies in each of the n_categories cells.
Cell_Expected—Named variable into which the cell expected values are stored. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct.
Cell_Chisq—Named variable into which an array of length n_categories containing the cell contributions to chi-squared is stored.
Discussion
Function CHISQTEST performs a chi-squared goodness-of-fit test that a random sample of observations is distributed according to a specified theoretical cumulative distribution. The theoretical distribution, which may be continuous, discrete, or a mixture of discrete and continuous distributions, is specified by the user-defined function f. Because the user is allowed to give a range for the observations, a test that is conditional upon the specified range is performed.
Parameter n_categories gives the number of intervals into which the observations are to be divided. By default, equiprobable intervals are computed by CHISQTEST, but intervals that are not equiprobable can be specified (through the use of keyword Cutpoints).
Regardless of the method used to obtain the cutpoints, the intervals are such that the lower endpoint is not included in the interval, while the upper endpoint is always included. If the cumulative distribution function has discrete elements, then user-provided cutpoints should always be used since CHISQTEST cannot determine the discrete elements in discrete distributions.
By default, the lower and upper endpoints of the first and last intervals are –infinity and +infinity. The endpoints can be specified by using the keywords Lower_Bound and Upper_Bound.
A tally of counts is maintained for the observations in x as follows:
*If the cutpoints are specified by the user, the tally is made in the interval to which xi belongs using the endpoints specified by the user.
*If the cutpoints are determined by CHISQTEST, then the cumulative probability at xi, F(xi), is computed by the function f.
The tally for xi is made in interval number:
where m = n_categories and:
is the function that takes the greatest integer that is no larger than the parameter of the function. Thus, if the computer time required to calculate the cumulative distribution function is large, user-specified cutpoints may be preferred in order to reduce the total computing time.
If the expected count in any cell is less than 1, then the chi-squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.
Programming Notes
The user must supply a function f with calling sequence F(y) that returns the value of the cumulative distribution function at any point y in the (optionally) specified range.
Many of the cumulative distribution functions in the PV‑WAVE IMSL Statistics Reference can be used for f. It is, however, necessary to write a user-defined PV‑WAVE Advantage function that calls the CDF, and then pass the name of this user-defined function for f.
Example
This example illustrates the use of CHISQTEST on a randomly generated sample from the normal distribution. One-thousand randomly generated observations are tallied into 10 equiprobable intervals. In this example, the null hypothesis is not rejected.
.RUN
; Define the hypothesized, cumulative distribution function.
- FUNCTION user_cdf, k
   -  RETURN, NORMALCDF(k)
- END
RANDOMOPT, Set = 123457
; Generate normal deviates.
x = RANDOM(1000, /Normal)
; Perform chi-squared test.
p_value = CHISQTEST('user_cdf', 10, x)
; Output the results.
PM, p_value
; PV-WAVE prints: 0.154603
Warning Errors
STAT_EXPECTED_VAL_LESS_THAN_1—An expected value is less than 1.
STAT_EXPECTED_VAL_LESS_THAN_5—An expected value is less than 5.
Fatal Errors
STAT_ALL_OBSERVATIONS_MISSING—All observations contain missing values.
STAT_INCORRECT_CDF_1—Function f is not a cumulative distribution function. The value at the lower bound must be nonnegative, and the value at the upper bound must not be greater than 1.
STAT_INCORRECT_CDF_2—Function f is not a cumulative distribution function. The probability of the range of the distribution is not positive.
STAT_INCORRECT_CDF_3—Function f is not a cumulative distribution function. Its evaluation at an element in x is inconsistent with either the evaluation at the lower or upper bound.
STAT_INCORRECT_CDF_4—Function f is not a cumulative distribution function. Its evaluation at a cutpoint is inconsistent with either the evaluation at the lower or upper bound.
STAT_INCORRECT_CDF_5—An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line.