FDSI-7 Multiple Regression in Matrix Form

Linear Regression in Matrix Notation

Notations

The model is

The least squares estimators are

define the vector of fitted values and residuals as

and it follows that

The Hat Matrix

Note that

where

is called the hat matrix, because it puts a “hat” on .

The diagonal entries of are referred to as the leverages.

Mathematically, is a projection matrix, as it “project” the vector of observed responses onto the space of all vectors which are linear combinations of the columns of . Besides, note that is symmetric and idempotent, meaning that .

projects the vector onto the null space of the columns of , because

Note that is also symmetric and idempotent.

Additional Results

The variance of is

If is assumed normal with mean zero and variance , the following hold:

is the maximum likelihood estimator

is multivariate normal with mean and covariance

is multivariate normal with mean and covariance

is multivariate normal with mean and covariance

is multivariate normal with mean zero and covariance

Degrees of Freedom

Suppose the has columns ( independent variables and one intercept), then there are degrees of freedom in the residuals. Therefore

The unbiased estimator of is

When the errors of i.i.d. normal, the statistic

has the t-distribution with degrees of freedom. Hence, a confidence interval for is formed as

and hypothesis tests concerning should compare the test statistic (the “t value”), with the t-distribution with degrees of freedom.

For large , we can appeal to the central limit theorem and construct an approximate confidence interval for using

Robust Regression

In the least squares approach to regression, we seek that minimizes the residual sum of squares

which is equivalent to minimizing

where .

The general concern with is that it may place too much weight on extreme observations because increases much more quickly than .

The choice may be optimal when the errors are normal, but it is not very robust to deviations from this assumption.

There are many possible choices for , for example, the Huber loss function,

where is set by the user. For technical reasons, the default choice is .

If is chosen large, the Huber loss function will get closer to least squares result

If is chosen closer to 0, the result is similar to (i.e. L1 regression)

Example

It can be observed that in both instances above, the Huber loss function effectively downweights the extreme observations, and , as a result, results in a closer approximation to the truth.

Robust regressions (using other loss function instead of is useful where it is suspected that the error distribution is not normal, and there is concern over extreme observations.