Categorical Variable
Consider a variable which indicates highest degree obtained
Code | Value |
0 | Did not finish high school |
1 | Finished high school |
2 | Some college |
3 | Graduated from college |
This variable is on the ordinal scale, meaning that there is a natural ordering to the levels, but ratios of values do not have meaning. If we mistakenly treat this as a variable on the ratio scale when including it in the regression model, we would be assuming that the change in expected response when going from “did not finish high school” to “some college” is twice the shift going from “did not finish high school” to “finished high school”.
Instead, for categorical variable with levels (above example has 4 levels), we should add parameters into the regression model. For example, consider a model including a predictor giving the number of credit cards, along with this education level predictor. The new terms correspond to “shifts” in the linear model relative to the “baseline” category. Let equal the number of credit cards, then we can write this linear model as
where is the indicator variable for the event .
When there is more than one categorical variables, simply add terms for each variable. For example, suppose a model contains only two predictors. has levels , and has levels . The just add 5 (=2 + 3) terms
Linear Molde | Correspond to |
The model can be written as
Interactions
Interaction terms are added to the model by taking the product of other predictors. In the absence of any interactions, we say that the model is additive in the predictors.
For example, . “The model is additive in and ” the effect on the expected response when is increased by is to increase it by , regardless of the value of .
Consider a model with two predictor variables and , with the interaction term
If is held fixed, and is increased by , then the expected response increases by .
The reason why it make sense to let the slope of depend on is to consider the interaction to be a first order approximation to this dependence.
Heteroskedasticity
Previously, we assume that all of the model errors have the same variance, i.e. homoskedasticity.
Consequences
Heteroskedasticity breaks one of the Gauss-Markov theorem, meaning that OLS estimators are not the Best Linear Unbiased Estimator (BLUE) and their variance is not the lowest of all other unbiased estimators.
Heteroskedasticity does not cause OLS coefficient estimates to be biased. However, it can cause OLS estimates of the variance (and thus, standard errors) of the coefficients to be biased, possibly above or below the true of population variance. Biased standard errors lead to biased inference, so results of hypothesis tests are possibly wrong. For example, if OLS is performed on a heteroskedasticity data set, yielding biased standard error estimation, a researcher might fail to reject a null hypothesis at a given significance level when the null hypothesis was actually uncharacteristic of the actual population.
Handling Heteroskedasticity
There are two major strategies for dealing with heteroskedasticity.
First, the response variable can be transformed. Use instead of as the response variable. The two most popular choices for are the logarithm, and the square root.
The other major strategy for dealing with heteroskedasticity is to use weighted least squares. That is, instead of minimizing the residual sum of squares, one minimizes
where the weights are chosen to deemphasize those observations for which the variance is larger.
Influential Observations
We expect that each data point should have some effect on the resulting fit, but no one observation should be overly influential; we’d prefer that each contribute a roughly equal amount to determining the final estimates.
A natural way of quantifying the influence of observation is to consider by how much do the fitted values change if observation is excluded from the training set, i.e., it is not used in fitting the model. This can be measured by the Cook’s Distance.
Let denote the fitted value for observation when using the model that excludes observation from the fitting. Cook’s Distance for is
In the case of linear regression, we don’t have to refit models to calculate , instead it can be calculated via the leverage term
Handling Influential Observations
Influential observations are almost always an outlier in some sense. But, it is a mistake to arbitrarily remove an observation from a training set. Instead, consider
- verify that there truly is a linear relationship between the response and predictor. A nonlinear model may be able to fit more flexibly to the features that led to this outlier.
- it may make more sense to redefine the population of interest in such a way that this outlying cases are handled by a separate model.
- a transformation of the predictor and/or response can “pull in” an outlier so that it’s not so extreme.
- choose an alternative aproach to estimate the ; for example, weighted least squares or robust regression.
Making Predictions
In the context of simple linear regression model, if our value for the predictor is , then we would start by finding
In the case of simple linear regression, the variance in the error in the prediction is
(1) The irreducible error: scatter around the regression line.
(2) Error in s: the fitted model is not exactly the same as the “true” model that generate those data.
Note that we’re assuming underlying model is correct. In practice, is large enough that (1) dominates.
Hence, the standard error in the prediction is
Multicollinearity
It is useful to consider two different “levels” of multicollinearity.
In the extreme version of multicollinearity, there is such a strong correlation among the predictors that there are numerical problems in deriving the least square estimators. That is, the inverse of the matrix will undefined or numerically unstable.
In the less extreme version, the correlation between the predictors makes it difficult to distinguish the individual contributions of predictors to the response. This is the more typical situation that one would encounter in practice. In fact, predictors are almost always somewhat correlated.
Test multicollinearity
Suppose there are predictors in the model. In order to quantify the relationship between and the other predictors, we fit a linear regression model with as the response and the other variables as the predictors:
If can be well-predicted by the other variables, then we have multicollinearity.
Note that multicollinearity does not require to be related to any one of the other predictors, but merely related to a linear combination of the other predictors.
Suppose this regression is fit, and the coefficient of determination is obtained, call it . It directly follows that close to one is a sign of multicollinearity. Besides, we define the variance inflation factor (VIF) for the k-th predictor to be
A large VIF is a sign of multicollinearity. Note that
where is the variance of if was the only predictor in model.
Therefore, is the amount by which the variance of the estimator for increases by including all of other predictors in the model. For example, if , then the variance of increases four times by including the other predictors.
Impacts of multicollinearity
Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, your don’t need to reduce severe multicollinearity.
Handling multicollinearity
(1) Remove some of the highly correlated independent variables
- note that if there are predictors with large VIF, then there is multicollinearity, but it is not clear which predictor should be removed. A pair of strongly correlated predictors will both have large VIFs, but this does not mean that both should be removed from the model
- VIF does not take into account the strength of the relationship (e.g. linear or non-linear relationship) of either predictor with the response. If two predictors are strongly correlated, then their respective (linear) correlations with the response will be roughly equal, but one will have a stronger relationship than the other.
(2) Linearly combine the independent variables, such as adding them together
(3) Partial least squares regression uses principal component analysis to create a set of uncorrelated components to include in the model.
(4) LASSO and Ridge regression
Loading Comments...