STEPWISE Procedure
Builds multiple linear regression models using forward, backward, or stepwise selection.
Usage
STEPWISE, x, y
Input Parameters
x—Two-dimensional array containing the data for the candidate variables.
y—Array of length N_ELEMENTS(x(*, 0)) containing the responses for the dependent variable.
Input Keywords
Double—If present and nonzero, double precision is used.
Weights—One-dimensional array containing the weight for each row of x. Default: Weights (*) = 1
Frequencies—One-dimensional array containing the frequency for each row of x. Default: Frequencies (*) = 1
First_Step, Inter_Step, Last_Step, and All_Steps—One or none of these options can be specified. If none of these is specified, the action defaults to All_Steps.
First_Step—This is the first invocation; additional calls will be made. Initialization and stepping is performed.
Inter_Step—This is an intermediate invocation. Stepping is performed.
Last_Step—This is the final invocation. Stepping and wrap-up computations are performed.
All_Steps—This is the only invocation. Initialization, stepping, and wrap-up computations are performed.
N_Steps—For nonnegative N_Steps, N_Steps steps are taken. If N_Steps = –1, stepping continues until completion. Default: N_Steps = 1
Keyword N_Steps is not referenced if All_Steps is used.
Forward, Backward, Stepwise—One or none of these options can be specified. If none is specified, the action defaults to Backward.
Forward—An attempt is made to add a variable to the model. A variable is added if its
p-value is less than
P_In. During initialization, only the forced variables enter the model.
Backward—An attempt is made to remove a variable from the model. A variable is removed if its p-value exceeds
P_Out. During initialization, all candidate independent variables enter the model.
Stepwise—A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization.
P_In—Largest p-value for variable entering the model. Variables with p-values less than P_In may enter the model. Default: P_In = 0.05
P_Out—Smallest p-value for removing variables with p-values greater than P_Out may leave the model. Keyword P_Out must be greater than or equal to P_In. A common choice for P_Out is 2*P_In. Default: P_Out = 0.10
Tolerance—Tolerance used in determining linear dependence. Default: Tolerance = 100*ε, where ε is machine precision.
Level—Array of length N_ELEMENTS(x(0, *)) + 1 containing levels of priority for variables entering and leaving the regression. Each variable is assigned a positive value that indicates its level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step. Level(i) = 0 means the ith variable is never to enter the model. Level(i) = –1 means the ith variable is the dependent variable. Level (N_ELEMENTS(x(0, *))) must correspond to the dependent variable, except when Cov_Input is specified. Default: 1, 1, ..., 1, –1, where –1 corresponds to Level (N_ELEMENTS(x(0, *)))
Force—Scalar integer specifying how variables are forced into the model as independent variables. Variable with levels 1, 2, ..., Force are forced into the model as independent variables. See Level.
Cov_Nobs—The number of observations associated with array Cov_Input. Keywords Cov_Input and Cov_Nobs must be used together.
Cov_Input—Two-dimensional square array of size (N_ELEMENTS(x(0,*)) + 1) × (N_ELEMENTS(x(0,*)) + 1) containing a variance-covariance or sum-of-squares and crossproducts matrix, in which the last column must correspond to the dependent variable.
Array Cov_Input can be computed using function COVARIANCES. Parameters x and y, and keywords Frequencies and Weights are not accessed when this option is specified. Normally, ALLBEST computes Cov_Input from the input data matrices x and y. However, there may be cases when the user wants to calculate the covariance matrix and manipulate it before calling ALLBEST. See the Discussion section for a discussion of such cases.
note | Keywords Cov_Input and Cov_Nobs must be used together. |
Output Keywords
Anova_Table—Named variable into which the one-dimensional array containing the analysis of variance table is stored. The analysis of variance statistics are as follows:
0
—degrees of freedom for regression
1
—degrees of freedom for error
2
—total degrees of freedom
3
—sum of squares for regression
4
—sum of squares for error
5
—total sum of squares
6
—regression mean square
7
—error mean square
8
—F-statistic
9
—p-value
10—
R2 (in percent)
11—adjusted
R2 (in percent)
12—estimate of the standard deviation
Coef_T_Tests—Named variable into which the 2D array containing statistics relating to the regression coefficient for the final model in this invocationing is stored. The rows correspond to the N_ELEMENTS(x(0, *)) in dependent variables. The rows are in the same order as the variables in x (or, if Cov_Input is specified, the rows are in the same order as the variables in Cov_Input). Each row corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variable corresponding to the row in question.
0
—coefficient estimate
1
—estimated standard error of the coefficient estimate
2
—t-statistic for the test that the coefficient is zero
3
—p-value for the two-sided
t test
Coef_Vif—Named variable into which the two-dimensional array containing variance inflation factors for the final model in this invocation is stored. The elements correspond to the N_ELEMENTS (x(0, *)) in dependent variables. The elements are in the same order as the variables in x (or, if Cov_Input is specified, the elements are in the same order as the variables in Cov_Input). Each element corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variables corresponding to the element in question.
The square of the multiple correlation coefficient for the ith regressor after all others have been obtained from VIF = Coef_Vif(i) by the following:
1.0 – (1.0/VIF)
Iend—Named variable into which an integer which indicates whether additional steps are possible is stored.
0
—Additional steps may be possible.
1
—No additional steps are possible.
Swept—Named variable into which the one-dimensional array of length (N_ELEMENTS(x(0, *)) + 1) with information to indicate the independent variables in the model is stored. Keyword Swept (N_ELEMENTS (x(0, *))) usually corresponds to the dependent variable (see Level).
–1—Variable
i is not in model.
1
—Variable
i is in model.
History—Named variable into which the one-dimensional array of length N_ELEMENTS (x(0, *)) + 1 containing the recent history of the independent variables is stored.
Element
History(N_ELEMENTS (
x(0, *))) usually corresponds to the dependent variable (see
Level) as shown in
History Variable.
Cov_Swept—Named variable into which the two-dimensional array of size N_ELEMENTS (x(0, *)) + 1) × (N_ELEMENTS (x(0, *)) + 1) that results after Cov_Swept has been swept on the columns corresponding to the variables in the model. The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of Cov_Swept corresponding to the independent variables in the final model and multiplying the elements of this matrix by Anova_Table(7).
Discussion
Procedure STEPWISE builds a multiple linear regression model using forward, backward, or forward stepwise (with a backward glance) selection. Procedure STEPWISE is designed so the user can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to STEPWISE (using keywords First_Step, Inter_Step, or Last_Step) are made. Alternatively, STEPWISE can be invoked once (default, or specify keyword All_Steps) in order to perform the stepping until a final model is selected.
Levels of priority can be assigned to the candidate independent variables (use keyword Level). All variables with a priority level of 1 must enter the model before variables with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc. Variables also can be forced into the model (see keyword Force). Note that specifying keyword Force without also specifying keyword Level results in all variables being forced into the model.
Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is used. Other possibilities are as follows:
The intercept is not in the model. A raw (uncorrected) sum-of-squares and crossproducts matrix for the independent and dependent variables is required as input in
Cov_Input. Keyword
Cov_Nobs must be set to 1 greater than the number of observations.
An intercept is to be a candidate variable. A raw (uncorrected) sum-of-squares and crossproducts matrix for the constant regressor (
=1), independent and dependent variables are required for
Cov_Input. In this case,
Cov_Input contains one additional row and column corresponding to the constant regressor. This row/column contains the sum-of-squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in
Cov_Input are the same as in the previous case. Keyword
Cov_Nobs must be set to 1 greater than the number of observations.
The stepwise regression algorithm is due to Efroymson (1960). Procedure STEPWISE uses sweeps of the covariance matrix (input using keyword
Cov_Input, if specified, or generated internally by default) to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is used. A description of the stepwise algorithm also is given by Kennedy and Gentle (1980, pp. 335–340). The advantage of stepwise model building over all possible regression (see
ALLBEST Procedure) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest
R2) for any subset size of independent variables.
Example
This example uses a data set from Draper and Smith (1981, pp. 629-630). Backwards stepping is performed by default. First, a procedure to output the results is defined.
PRO print_results, anova_table, t, s
labels = ['df for regression ', $
'df for error ', $
'total df ', $
'ss for regression ', $
'ss for error ', $
'total ss ', $
'mean square for regression ', $
'mean square error ', $
'F-statistic ', $
'p-value ', $
'R-squared (in percent) ', $
'adjusted R-squared (in percent)']
PRINT
PRINT, ' * * Analysis of Variance * *'
; Print the table.
FOR i=0L, 11 DO PRINT, labels(i), $
anova_table(i), Format = '(a32,f8.2)'
PRINT
PRINT, '* * Inference on Coefficients * *'
PRINT, ' Estimate s.e. t' + $
' prob>t swept'
PRINT,'$(a, 4f10.4)','variable 1',t(0,*),s(0)
PRINT,'$(a, 4f10.4)','variable 2',t(1,*),s(1)
PRINT,'$(a, 4f10.4)','variable 3',t(2,*),s(2)
PRINT,'$(a, 4f10.4)','variable 4',t(3,*),s(3)
END
; Define the data.
x = MAKE_ARRAY(13, 4)
x(0, *) = [7., 26., 6., 60.]
x(1, *) = [1., 29., 15., 52.]
x(2, *) = [11., 56., 8., 20.]
x(3, *) = [11., 31., 8., 47.]
x(4, *) = [7., 52., 6., 33.]
x(5, *) = [11., 55., 9., 22.]
x(6, *) = [3., 71., 17., 6.]
x(7, *) = [1., 31., 22., 44.]
x(8, *) = [2., 54., 18., 22.]
x(9, *) = [21., 47., 4., 26.]
x(10, *) = [1., 40., 23., 34.]
x(11, *) = [11., 66., 9., 12.]
x(12, *) = [10., 68., 8., 12.]
y = [78.5, 74.3, 104.3, 87.6, 95.9, $
109.2, 102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4]
STEPWISE, x, y, Anova_Table = anova_table, $
Coef_T_Tests = t, swept = s
print_results, anova_table, t, s
This results in the following output:
* * Analysis of Variance * *
df for regression 2.00
df for error 10.00
total df 12.00
ss for regression 2657.86
ss for error 57.90
total ss 2715.76
mean square for regression 1328.93
mean square error 5.79
F-statistic 229.50
P-value 0.00
R-squared (in percent) 97.87
adjusted R-squared (in percent) 97.44
* * Inference on Coefficients * *
Estimate s.e. t prob>t swept
variable 1 1.4683 0.1213 12.1046 0.0000 1.
variable 2 0.6623 0.0459 14.4423 0.0000 1.
variable 3 0.2500 0.1847 1.3536 0.2089 -1.
variable 4 -0.2365 0.1733 -1.3650 0.2054 -1.
Warning Errors
STAT_LINEAR_DEPENDENCE_1—Based on Tolerance = #, there are linear dependencies among the variables to be forced.
Fatal Errors
STAT_NO_VARIABLES_ENTERED—No variables entered the model. All elements of Anova_Table are set to NaN.