Multiple Linear Regression
In the late 1880s, Francis Galton was studying the inheritance of physical characteristics. In particular, he wondered if he could predict a boy’s adult height based on the height of his father. Galton hypothesized that the taller the father, the taller the son would be. He plotted the heights of fathers and the heights of their sons for a number of father-son pairs, then tried to fit a straight line through the data. If we denote the son’s height by HS and the father’s height by HF, we can say that in mathematical terms, Galton wanted to determine constants β0 and β1 such that:
HS= β0 + β1HF
This is an example of a simple linear regression problem with a single predictor variable, HF. The parameter β0 is called the intercept parameter. In general, a regression problem may consist of several predictor variables. Thus the multiple linear regression problem may be stated as follows:
Let Y be a random variable that can be expressed in the form:
Y = β0 + β1x1 + ... + βp – 1 + ε
where x1, x2, ... , xp – 1 are known constants, and ε is a fluctuation error. The problem is to estimate the parameters βj. If the xj are varied and the n values Y1, Y2, ..., Yn of Y are observed, then we write:
Yi = β0 + β1xi1 + ... + βp – 1xi, p – 1 + εi (i = 1, 2, ..., n)
where x
ij is the
ith value of x
j. Writing these
n equations in matrix form we have:
or:
Y = Xβ + ε
where x10 = x20 = ... = xn0 = 1
We call the
matrix
X the
regression matrix, each Y
i a
response variable,
Y the
response vector, and x
j the
predictor variable.