HomeTechniques and TipsStatToolsMulticollinearity

# 17.7. Multicollinearity

Applies to:  StatTools; @RISK for Excel; @RISK For Project; @RISK Developer Kit

What is multicollinearity, and how can I use StatTools to test for it?

Short version: In StatTools 7.0 and newer, on the Options tab of the Regression dialog, tick the box for "Check Multicollinearity" and the box to show the correlation matrix. StatTools will calculate a Variance Inflation Factor (VIF) for each independent variable. Large VIF indicate multicollinearity. Look in the correlation matrix to see which pairs of candidate variables are highly correlated. When a pair of variables have large VIF and are highly correlated, you may want to exclude one of the pair from the regression.

What is VIF, and what do we mean by "large VIF"? Wikipedia says, "The square root of the VIF indicates how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model.  Example: If the VIF of a predictor variable were 5.27 (√5.27 = 2.3) this means that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables."  In Detecting Multicollinearity Using Variance Inflation Factors, Penn State says, "The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction."

Long version:

Each coefficient in a regression equation indicates the effect of one independent variable (explanatory variable) on the dependent variable (response variable), provided that the other independent variables in the equation remain constant. You could say that the coefficient represents the effect of this independent variable on the dependent variable in addition to the effects of the other variables in the equation. Therefore, the relationship between an independent variable Xj and the dependent variable Y depends on which other X's are included or not included in the equation.

This is especially true when there is a linear relationship between two or more independent variables, in which case we have multicollinearity. Multicollinearity is defined as "the presence of a fairly strong linear relationship between two or more explanatory variables", and it can make estimation difficult.

Example: This example and text have been adapted for this article from Managerial Statistics by Albright, Winston, Zappe, published by Duxbury Thomson Learning. Contact Palisade Corporation for ordering information, if you like this explanation of multicollinearity.

Consider the attached file. It is a very simple example, but it is serves the purpose of demonstrating the warnings of and how to deal with and recognize multicollinearity. (You need to open the file in StatTools to see all features.)

We want to explain a person's height by means of foot length. The response variable is Height, and the explanatory variables are Right and Left, the lengths of the right foot and the left foot. The question is, "What can occur when we regress Height on both Right and Left?"

To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths. We will use StatTools for the regression analysis, though @RISK can also do regression on input distributions (independent) and outputs (dependent).

On first inspection of this problem, common sense dictates that there is no need to include both Right and Left in an equation for Height. One or the other would be sufficient. In this example, however, we include them to make a point about the dangers of multicollinearity.

After creating a correlation matrix in StatTools with Summary Statistics » Correlation and Covariance, we notice a large correlation between height and foot size. Therefore, we would expect this regression equation to do a good job.  And our intuition is correct; the R² value is 0.817. This R² value is relatively large and would probably cause us to believe the relationship is very strong.

But what about the coefficients of Right and Left? Here is where the problem begins. The coefficient of Right indicates the right foot's effect on Height in addition to the effect of the left foot. That is, after the effect of Left on Height has already been taken into account, the extra information provided by Right is probably minimal. This can go both ways regarding Left and Right.

We created the data set so that except for random error, height is approximately 32 plus 3.2 times foot length (all expressed in inches). As shown in our correlation matrix using StatTools in Height.xls, the correlation between Height and either Right or Left in our data set is quite large, and the correlation between Right and Left is very close to 1.

The regression output when both Right and Left are entered in the equation for Height appears in Heights.xls. This tells a somewhat confusing story. The multiple R and the corresponding R² are about what we would expect, given the correlations between Height and either Right or Left in Height.xls. In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the Standard Error value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.

However, the coefficients of Right and Left are not at all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. In fact, the coefficient of Left is the wrong sign—it is negative! Besides this "wrong" sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p-value is quite large. We might conclude that Height and Left are either not related or are related negatively. But we know from Height.xls that both of these conclusions are false.

In contrast, the coefficient of Right has the "correct" sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level. However, this happened mostly by chance. Slight changes in the data could change the results completely—the coefficient of Right could become negative and insignificant, or both coefficients could become insignificant. The problem is that although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. Note that the regression equation does estimate the combined effect fairly well—the sum of the coefficients of Right and Left is 6.823 + (-3.645) = 3.178. This is close to the coefficient 3.2 that we used to generate the data. Also, the estimated intercept 31.760 is close to the intercept 32 we used to generate the data. Therefore, the estimated equation will work well for predicting heights. It just does not have reliable estimates of the individual coefficients of Right and Left.

When Right is the only variable in the equation as seen in Heights.xls, it becomes

Predicted Height = 31.546 + 3.195*Right

R² is 81.6%, Standard Error is 2.005, and the t-value and p-value for the coefficient of Right are now 21.34 and 0.0000—very significant. Similarly, when Left is the only variable in the equation, it becomes

Predicted Height = 31.526 + 3.197*Left

R² is 81.1% and Standard Error is 2.033; the t-value and p-value for the coefficient of Left are 20.99 and 0.0000—again, very significant. Clearly, these two equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.

This example illustrates an extreme form of multicollinearity, where two explanatory variables are very highly correlated. In general, there are various degrees of multicollinearity. In each of them, there is a linear relationship between two or more explanatory variables, and this relationship makes it difficult to estimate the individual effect of the X's on the response variable.

Some common symptoms of multicollinearity can be:

• wrong signs of the coefficients
• smaller-than-expected t-values
• larger-than-expected (insignificant) p-values.

In other words, variables that are really related to the response variable can look like they aren't related, based on their p-values. The reason is that their effects on Y are already explained by other X's in the equation.

Sometimes multicollinearity is easy to spot and treat. For example, it would be silly to include both Right and Left foot length in the equation for Height as seen in our example. They are obviously very highly correlated and only one is needed in the equation for Height. The solution then is to exclude one of them and re-estimate the equation.

However, multicollinearity is not usually this easy to treat or even diagnose. Suppose, for example, that we want to use regression to explain variations in salary. Three potentially useful explanatory variables are age, years of experience in the company, and years of experience in the industry. It is very likely that each of these is positively related to salary, and it is also very likely that they are very closely related to each other. However, it isn't clear which, if any, we should exclude from the regression equation. If we include all three, we are likely to find that at least one of them is insignificant (high p-value), in which case we might consider excluding it from the equation. If we do so, the R-squared and Standard Error values will probably not change very much—the equation will provide equally good predicted values—but the coefficients of the variables that remain in the equation could change considerably.

Last edited: 2017-05-12