8 Regression using Matrices

Everything we’ve done so far can be written in matrix form. Though it might seem no more efficient to use matrices with simple linear regression, it will become clear that with multiple linear regression, matrices can be very powerful. ALSM chapter 5 contains a lot of matrix theory; the main take away points from the chapter have to do with the matrix theory applied to the regression setting. Please make sure that you read the chapters / examples having to do with the regression examples.

8.1 Special Matrices

1 = \(\begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix}\)

I = \(\begin{pmatrix} 1 & 0 & \cdots & 0\\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{pmatrix}\)

J = \(\begin{pmatrix} 1 & 1 & \cdots & 1\\ 1 & 1 & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & \cdots & 1 \end{pmatrix}\)

0 = \(\begin{pmatrix} 0 & 0 & \cdots & 0\\ 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{pmatrix}\)

8.2 Regression Model

8.2.1 Matrix Addition

\[\begin{eqnarray*} Y_i &=& E[Y_i] + \epsilon_i\\ &&\\ \begin{pmatrix} Y_1\\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} &=& \begin{pmatrix} E[Y_1]\\ E[Y_2] \\ \vdots \\ E[Y_n] \end{pmatrix} + \begin{pmatrix} \epsilon_1\\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix}\\ &&\\ \underline{Y} &=& E[\underline{Y}] + \underline{\epsilon}\\ \end{eqnarray*}\] Similarly, \[\begin{eqnarray*} Y_i &=& \hat{Y}_i + e_i\\ &&\\ \begin{pmatrix} Y_1\\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} &=& \begin{pmatrix} \hat{Y}_1\\ \hat{Y}_2 \\ \vdots \\ \hat{Y}_n \end{pmatrix} + \begin{pmatrix} e_1\\ e_2 \\ \vdots \\ e_n \end{pmatrix}\\ &&\\ \underline{Y} &=& \underline{\hat{Y}} + \underline{e}\\ \end{eqnarray*}\]

8.3 Matrix Multiplication

8.3.1 Example

Consider multiplying an \(r \times c\) matrix with a \(c \times s\) matrix. The interior dimensions must always be the same. The resulting matrix will always be \(r \times s\). The element in the \(i^{th}\) row and \(j^{th}\) column is given by: \[\begin{eqnarray*} \sum_{k=1}^c a_{ik} b_{kj}\\ AB &=& \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{pmatrix} \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \\ b_{31} & b_{32} \end{pmatrix}\\ &=& \begin{pmatrix} a_{11} b_{11} + a_{12} b_{21} + a_{13}b_{31} & a_{11} b_{12} + a_{12} b_{22} + a_{13}b_{32} \\ a_{21} b_{11} + a_{22} b_{21} + a_{23}b_{31} & a_{21} b_{12} + a_{22} b_{22} + a_{23}b_{32} \end{pmatrix}\\ &&\\ AB &=& \begin{pmatrix} 3 & -1 & 0 \\ 0 & 1 & 1 \\ -2 & 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 1\\ -1 & 2 \\ 0 & -1 \end{pmatrix}\\ &=& \begin{pmatrix} 4 & 1\\ -1 & 1 \\ -2 & -3 \end{pmatrix} \end{eqnarray*}\]

and in the linear regression context…

\[\begin{eqnarray*} E[Y_i] &=& \beta_0 + \beta_1 X_i\\ &&\\ \begin{pmatrix} E[Y_1]\\ E[Y_2] \\ \vdots \\ E[Y_n] \end{pmatrix} &=& \begin{pmatrix} 1 & X_1\\ 1 & X_2 \\ \vdots \\ 1 & X_n \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}\\ &&\\ E[\underline{Y}] &=& X \underline{\beta}\\ \end{eqnarray*}\]

\[\begin{eqnarray*} \underline{Y}^t \underline{Y} &=& \begin{pmatrix} Y_1 & Y_2 & \cdots & Y_n \end{pmatrix} \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} = \sum_{i=1}^n Y_i^2\\ &&\\ X^t X &=& \begin{pmatrix} 1 & 1 & \cdots & 1\\ X_1 & X_2 & \cdots & X_n \end{pmatrix} \begin{pmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_n \end{pmatrix}= \begin{pmatrix} n & \sum_{i=1}^n X_i \\ \sum_{i=1}^n X_i & \sum_{i=1}^n X_i^2 \end{pmatrix}\\ &&\\ X^t \underline{Y} &=& \begin{pmatrix} 1 & 1 & \cdots & 1\\ X_1 & X_2 & \cdots & X_n \end{pmatrix} \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} = \begin{pmatrix} \sum_{i=1}^n Y_i \\ \sum_{i=1}^n X_i Y_i \end{pmatrix} \end{eqnarray*}\]

8.4 Matrix Inverses in the Regression Context

8.4.1 Matrix Inverses

An \(n \times n\) matrix \(A\) is called invertible if there exists an \(n \times n\) matrix \(B\) such that \[\begin{eqnarray*} A B = B A = I_n \end{eqnarray*}\] \(B\) is called the inverse of \(A\) and is typically denoted by \(B = A^{-1}\). (Note, inverses only exist for square matrices with non-zero determinants.)

8.4.2 Example

\[A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}\]

\[ A^{-1} = \begin{pmatrix} d / (ad - bc) & -b / (ad - bc) \\ -c / (ad - bc) & a / (ad - bc) \end{pmatrix}\]

where the determinant is given by \(D = ad - bc\). \[\begin{eqnarray*} A^{-1} A &=& \begin{pmatrix} d / (ad - bc) & -b / (ad - bc) \\ -c / (ad - bc) & a / (ad - bc) \end{pmatrix} \begin{pmatrix} a & b \\ c & d \end{pmatrix}\\ &=& \begin{pmatrix} (ad - bc) / (ad - bc) & (bd - bd) / (ad - bc) \\ (-ac + ac) / (ad - bc) & (ad - bc) / (ad - bc) \end{pmatrix}\\ &=& I_2 \end{eqnarray*}\]

\[A = \begin{pmatrix} 3 & 2 \\ 1 & 6 \end{pmatrix}\]

\[ A^{-1} = \begin{pmatrix} 6 / 16 & -2 / 16 \\ -1 / 16 & 3 / 16 \end{pmatrix} \]

\[\begin{eqnarray*} A^{-1} A &=& \begin{pmatrix} 6 / 16 & -2 / 16 \\ -1 / 16 & 3 / 16 \end{pmatrix} \begin{pmatrix} 3 & 2 \\ 1 & 6 \end{pmatrix}\\ &=& \begin{pmatrix} (18-2)/16 & (12-12)/16 \\ (-3+3)/16 & (-2+18)/16 \end{pmatrix}\\ &=& I_2 \end{eqnarray*}\]

8.4.3 Variance of Coefficients

What is a Variance Matrix?

Consider two random variables, \(U\) and \(W\). We’d typically have \(n\) observations of each variable, resulting in data vectors \(\underline{U}\) and \(\underline{W}\). The variance-covariance matrix describing the variability relationship between \(U\) and \(W\) is given by: \[\begin{eqnarray*} \mbox{var}\{\underline{U},\underline{W} \} &=& \begin{pmatrix} \sigma_U^2 & \sigma_{U,W} \\ \sigma_{W,U} & \sigma_W^2 \end{pmatrix}\\ \mbox{where } \sigma_U^2 &=& \mbox{ variance of } U \mbox{ (a number)}\\ \sigma_W^2 &=& \mbox{ variance of } V \mbox{ (a number)}\\ \sigma_{U,W} &=& \sigma_{W,U} = \mbox{ covariance of } U \mbox{ and } W \mbox{ (a number)} \\ \end{eqnarray*}\] In order to find the variance matrix for the regression coefficients, we use matrix notation. \[\begin{eqnarray*} \mbox{Recall: } X^t X &=& \begin{pmatrix} n & \sum_{i=1}^n X_i \\ \sum_{i=1}^n X_i & \sum_{i=1}^n X_i^2 \end{pmatrix}\\ &&\\ \mbox{So, } D &=& n \sum X_i^2 - (\sum X_i)^2 = n \sum(X_i - \overline{X})^2\\ &&\\ (X^t X)^{-1} &=& \begin{pmatrix} \frac{\sum X_i^2}{n \sum(X_i - \overline{X})^2} & \frac{-\sum_{i=1}^n X_i}{n \sum(X_i - \overline{X})^2} \\ \frac{-\sum_{i=1}^n X_i}{n \sum(X_i - \overline{X})^2} & \frac{n}{n \sum(X_i - \overline{X})^2} \end{pmatrix}\\ &&\\ &=& \begin{pmatrix} \frac{1}{n} + \frac{\overline{X}^2}{\sum(X_i - \overline{X})^2} & \frac{-\overline{X}}{\sum(X_i - \overline{X})^2} \\ \frac{-\overline{X}}{ \sum(X_i - \overline{X})^2} & \frac{1}{ \sum(X_i - \overline{X})^2} \end{pmatrix}\\ &&\\ \mbox{var}\{\underline{b}\} &=& \sigma^2 \cdot (X^t X)^{-1}\\ &=& \begin{pmatrix} \sigma^2_{b_0} & \sigma_{b_0, b_1} \\ \sigma_{b_1, b_0} & \sigma^2_{b_1} \end{pmatrix}\\ SE^2\{\underline{b}\} &=& MSE \cdot (X^t X)^{-1}\\ \end{eqnarray*}\]

Estimating Coefficients

Recall the equations that come from differentiating the sum of squared residuals with respect to both \(\beta_0\) and \(\beta_1\):

\[\begin{eqnarray*} n b_0 + b_1 \sum X_i &=& \sum Y_i\\ b_0 \sum X_i + b_1 \sum X_i^2 &=& \sum X_i Y_i\\ &&\\ \begin{pmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} &=& \begin{pmatrix} \sum Y_i \\ \sum X_i Y_i \end{pmatrix}\\ &&\\ (X^t X) \underline{b} &=& X^t \underline{Y}\\ \underline{b} &=& (X^t X)^{-1} (X^t \underline{Y})\\ &&\\ \mbox{var}\{\underline{b} \} &=& (X^t X)^{-1} X^t \sigma^2 I X (X^t X)^{-1}\\ &=& \sigma^2 \cdot (X^t X)^{-1}\\ &&\\ \mbox{checking:}&&\\ (X^t X)^{-1} (X^t \underline{Y}) &=& \begin{pmatrix} \frac{1}{n} + \frac{\overline{X}^2}{\sum(X_i - \overline{X})^2} & \frac{-\overline{X}}{\sum(X_i - \overline{X})^2} \\ \frac{-\overline{X}}{ \sum(X_i - \overline{X})^2} & \frac{1}{ \sum(X_i - \overline{X})^2} \end{pmatrix} \begin{pmatrix} \sum_{i=1}^n Y_i \\ \sum_{i=1}^n X_i Y_i \end{pmatrix}\\ &&\\ &=& \begin{pmatrix} \frac{\sum Y_i}{n} + \frac{\sum Y_i \overline{X}^2}{\sum(X_i - \overline{X})^2} + \frac{-\sum X_i Y_i (\overline{X})}{\sum(X_i - \overline{X})^2} \\ \frac{-\sum Y_i (\overline{X})}{ \sum(X_i - \overline{X})^2} + \frac{\sum X_i Y_i}{ \sum(X_i - \overline{X})^2} \end{pmatrix}\\ &&\\ &=& \begin{pmatrix} \overline{Y} - b_1 \overline{X} \\ \frac{\sum Y_i (X_i - \overline{X})}{\sum(X_i - \overline{X})^2} \end{pmatrix} = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} \end{eqnarray*}\]

8.5 Fitted Values

\[\begin{eqnarray*} \hat{Y}_i &=& b_0 + b_1 X_i\\ &&\\ \begin{pmatrix} \hat{Y}_1 \\ \hat{Y}_2 \\ \vdots \\ \hat{Y}_n \end{pmatrix} &=& \begin{pmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\\ &&\\ \underline{\hat{Y}} &=& X \underline{b}\\ &=& X (X^t X)^{-1} (X^t \underline{Y})\\ &=& H \underline{Y}\\ \mbox{"hat" matrix: } H &=& X (X^t X)^{-1} X^t \end{eqnarray*}\] We call \(H\) the hat matrix because it takes \(\underline{Y}\) and puts a hat on it. Note that the predicted values are simply a linear combinations of the response variable (\(Y\)) with coefficients of the explanatory variables (\(X\)).

8.6 Residuals

\[\begin{eqnarray*} e_i &=& Y_i - \hat{Y}_i\\ \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix} &=& \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\Y_n \end{pmatrix} - \begin{pmatrix} \hat{Y}_1 \\ \hat{Y}_2 \\ \vdots \\ \hat{Y}_n \end{pmatrix}\\ \underline{e} &=& \underline{Y} - \hat{\underline{Y}}\\ &=& \underline{Y} - X \underline{b}\\ &=& \underline{Y} - H \underline{Y}\\ &=& (I - H) \underline{Y}\\ &&\\ \mbox{var}\{ \underline{e} \} &=& \mbox{var}\{ (I-H) \underline{Y} \}\\ &=& (I - H) \mbox{var}\{ \underline{Y} \}\\ &=& (I - H) \cdot \sigma^2 \cdot (I - H)^t\\ &=& \sigma^2 \cdot (I - H) (I-H^t)\\ &=& \sigma^2 \cdot (I - H - H^t + HH^t)\\ &=& \sigma^2 \cdot (I-H)\\ SE^2(\underline{e}) &=& MSE \cdot (I-H) \end{eqnarray*}\]

8.7 ANalysis Of VAriance

\[\begin{eqnarray*} SSTO &=& \underline{Y}^t \underline{Y} - \bigg(\frac{1}{n} \bigg) \underline{Y}^t J \underline{Y}\\ &=& \sum Y_i ^2 - \frac{1}{n} \begin{pmatrix} Y_1 & Y_2 & \cdots & Y_n \end{pmatrix} \begin{pmatrix} 1 & 1 & \cdots & 1\\ 1 & 1 & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & \cdots & 1 \end{pmatrix} \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} \\ &=& \sum Y_i ^2 - \frac{1}{n} \begin{pmatrix} Y_1 & Y_2 & \cdots & Y_n \end{pmatrix} \begin{pmatrix} \sum Y_i \\ \sum Y_i \\ \vdots \\ \sum Y_i \end{pmatrix} \\ &=& \sum Y_i ^2 - \frac{1}{n} \sum Y_i \sum Y_i\\ &=& \sum Y_i ^2 - n \overline{Y}^2\\ \mbox{note: } \sum(Y_i - \overline{Y})^2 &=& \sum (Y_i^2 - 2Y_i \overline{Y} + \overline{Y}^2)\\ &=& \sum Y_i^2 -2\overline{Y}\sum Y_i + n \overline{Y}^2\\ &=& \sum Y_i^2 - n \overline{Y}^2\\ &&\\ SSE &=& \underline{Y}^t \underline{Y} - \underline{b}^t X^t \underline{Y}\\ &&\\ SSR &=& \underline{b}^t X^t \underline{Y} - \bigg(\frac{1}{n} \bigg) \underline{Y}^t J \underline{Y} \end{eqnarray*}\]

8.8 Prediction of New Observations

\(\underline{X}_h = \begin{pmatrix} 1\\ X_h \end{pmatrix}\)	\(\underline{X}_h^t = \begin{pmatrix} 1 & X_h \end{pmatrix}\)

\[\begin{eqnarray*} \hat{Y}_h &=& \underline{X}^t_h \underline{b}\\ \mbox{var}\{ \hat{Y}_h \} &=& \underline{X}_h^t \mbox{var}\{\underline{b} \} \underline{X}_h\\ &=& \sigma^2 \cdot \underline{X}_h^t (X^t X)^{-1} \underline{X}_h\\ SE^2\{\hat{Y}_h \} &=& MSE \cdot \underline{X}_h^t (X^t X)^{-1} \underline{X}_h\\ &&\\ &&\\ SE^2 \{\mbox{pred} \} = SE^2\{\hat{Y}_{h(new)} \} &=& MSE \cdot (1 + \underline{X}_h^t (X^t X)^{-1} \underline{X}_h)\\ \end{eqnarray*}\]

8.9 Reflection Questions

How is the normal errors linear model written in matrix form?
Why does the X matrix have an added column of ones?
Using matrix notation, how are the normal equations used to solve for the least squares estimates of \(\beta_0\) and \(\beta_1\)?
What is a covariance term? What does it mean for variables to be correlated (e.g., \(b_0\) and \(b_1\))?
Why is the “hat” matrix (\(H\)) so named?

8.10 Ethics Considerations

Why are the sampling distributions of \(b_0\) and \(b_1\) correlated?
In what ways does writing the linear model in matrix notation make the extension to more explanatory variables easier?
Are there any fundamental differences to the SLR model written in matrix notation (as opposed to indexing the elements for each equation)?

8.11 R: Matrices

8.11.1 Addition

Adding matrices gives you what you’d likely expect.

matrix1 <- matrix(c(1:12),ncol=4, byrow=T)
matrix2 <- matrix(seq(2,24,by=2),ncol=4, byrow=T)

matrix1

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

matrix2

##      [,1] [,2] [,3] [,4]
## [1,]    2    4    6    8
## [2,]   10   12   14   16
## [3,]   18   20   22   24

matrix1 + matrix2

##      [,1] [,2] [,3] [,4]
## [1,]    3    6    9   12
## [2,]   15   18   21   24
## [3,]   27   30   33   36

8.11.2 Multiplication

The * produces element by element multiplication.

matrix1 * matrix2

##      [,1] [,2] [,3] [,4]
## [1,]    2    8   18   32
## [2,]   50   72   98  128
## [3,]  162  200  242  288

The %*% does matrix multiplication. Note the error here because you can’t multiply a \(3 \times 4\) by a \(3 \times4\) matrix.

matrix1 %*% matrix2

## Error in matrix1 %*% matrix2: non-conformable arguments

If either of the matrices are transposed, they will be conformable, and we will be able to multiply them.

matrix1 %*% t(matrix2)

##      [,1] [,2] [,3]
## [1,]   60  140  220
## [2,]  140  348  556
## [3,]  220  556  892

t(matrix1) %*% matrix2

##      [,1] [,2] [,3] [,4]
## [1,]  214  244  274  304
## [2,]  244  280  316  352
## [3,]  274  316  358  400
## [4,]  304  352  400  448

Note that the products are only symmetric because matrix1 = \(2 \cdot\) matrix2.

8.11.3 Taking Inverses

The function for inverting matrices in R is solve(). Remember that solve() can only works on square matrices with non-zero determinants.

matrix3 <- matrix(c(5,7,1,4,3,6,2,0,8), ncol=3, byrow=T)
matrix3

##      [,1] [,2] [,3]
## [1,]    5    7    1
## [2,]    4    3    6
## [3,]    2    0    8

solve(matrix3)

##        [,1]   [,2] [,3]
## [1,] -0.923  2.154 -1.5
## [2,]  0.769 -1.462  1.0
## [3,]  0.231 -0.538  0.5

Multiplying a matrix by its inverse results in the identity matrix.

matrix3 %*% solve(matrix3)

##          [,1]      [,2]     [,3]
## [1,] 1.00e+00 -2.22e-16 1.11e-16
## [2,] 4.44e-16  1.00e+00 0.00e+00
## [3,] 4.44e-16  0.00e+00 1.00e+00

8.11.4 Concatenating 1s for the intercept

Let’s say we have an explanatory variable called xvar

set.seed(7447)
xvar <- rnorm(12, 47, 3)
xvar

##  [1] 47.3 43.1 47.1 48.0 49.7 45.2 47.7 47.1 47.9 48.9 43.4 44.1

rep(1,12)  # repeats the number 1, 12 times

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1

Xmatrix <- cbind(int = rep(1,12), xvar)
Xmatrix

##       int xvar
##  [1,]   1 47.3
##  [2,]   1 43.1
##  [3,]   1 47.1
##  [4,]   1 48.0
##  [5,]   1 49.7
##  [6,]   1 45.2
##  [7,]   1 47.7
##  [8,]   1 47.1
##  [9,]   1 47.9
## [10,]   1 48.9
## [11,]   1 43.4
## [12,]   1 44.1

7 Simultaneous Inference

9 Multiple Linear Regression