> \pStan Ba==h;#
8X@"1Arial1Arial1Arial1Arial1Arial1Arial1Arial1Arial1Tahoma1Tahoma1Arial1Arial1Arial1Arial"$"#,##0_);\("$"#,##0\)!"$"#,##0_);[Red]\("$"#,##0\)""$"#,##0.00_);\("$"#,##0.00\)'""$"#,##0.00_);[Red]\("$"#,##0.00\)7*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_).))_(* #,##0_);_(* \(#,##0\);_(* "-"_);_(@_)?,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)6+1_(* #,##0.00_);_(* \(#,##0.00\);_(* "-"??_);_(@_)0.0000
0.000[<0.0001]"< 0.0001";0.0000 + ) , * ! ! #p6@@( #p6@( #p`6 @( `6@ d3@ dc@ `6@ d3@ dc@ `6`@ d3`@ dc`@ " 1" 1"<` 1! 1! 1!<` " " " " 1 0` 1" `InstructionseData_STDS_DG1D07110:# StatCorrelationCoeff# ST_Height;k!ST_Left;k"ST_Right;kB,STWBD_StatToolsCorrAndCovar_CorrelationTableTRUEB+STWBD_StatToolsCorrAndCovar_CovarianceTableFALSE@*STWBD_StatToolsCorrAndCovar_HasDefaultInfoTRUE>*STWBD_StatToolsCorrAndCovar_TableStructure 2:(STWBD_StatToolsCorrAndCovar_VariableListI*STWBD_StatToolsCorrAndCovar_VariableList_1
UVG387AE4D9I*STWBD_StatToolsCorrAndCovar_VariableList_2
UVG2B59F32FH*STWBD_StatToolsCorrAndCovar_VariableList_3UVG74D9354P5STWBD_StatToolsCorrAndCovar_VarSelectorDefaultDataSet DG1D071107#STWBD_StatToolsRegression_blockList-1?)STWBD_StatToolsRegression_ConfidenceLevel .95='STWBD_StatToolsRegression_FValueToEnter 2.2='STWBD_StatToolsRegression_FValueToLeave 1.1O8STWBD_StatToolsRegression_GraphFittedValueVsActualYValueFALSEI2STWBD_StatToolsRegression_GraphFittedValueVsXValueFALSEK4STWBD_StatToolsRegression_GraphResidualVsFittedValueFALSEF/STWBD_StatToolsRegression_GraphResidualVsXValueFALSE>(STWBD_StatToolsRegression_HasDefaultInfoTRUEB+STWBD_StatToolsRegression_IncludePredictionFALSE=&STWBD_StatToolsRegression_IncludeStepsFALSE<(STWBD_StatToolsRegression_NumberOfBlocks 0='STWBD_StatToolsRegression_pValueToEnter .05<'STWBD_StatToolsRegression_pValueToLeave .1<(STWBD_StatToolsRegression_RegressionType 0>'STWBD_StatToolsRegression_throughOriginFALSE:#STWBD_StatToolsRegression_useFValueFALSE9#STWBD_StatToolsRegression_usePValueTRUEJ+STWBD_StatToolsRegression_VariableDependent
UVG387AE4D9C1STWBD_StatToolsRegression_VariableListIndependentQ3STWBD_StatToolsRegression_VariableListIndependent_1UVG74D9354N3STWBD_StatToolsRegression_VarSelectorDefaultDataSet DG1D07110"bZ !)3A@@
d?;Relationship between height and left and right foot lengthsHeightRightLeftNameHeight DataGUID DG1D07110Format RangeVariable LayoutColumnsVariable Names In CellsVariable Names In 2nd CellsData Set RangesData Sheet FormatFormula Eval CellNum Stored Vars1 : Info
VG387AE4D9var1 ST_Height
1 : Ranges
1 : MultiRefs2 : Info
VG2B59F32Fvar2ST_Right
2 : Ranges
2 : MultiRefs3 : Info VG74D9354var3ST_Left
3 : Ranges
3 : MultiRefsANOVA TableRegression TableMultipleRR-SquareAdjustedStErr ofEstimateSummary
Degrees ofFreedomSum ofSquaresMean of F-Ratiop-Value ExplainedUnexplainedCoefficientStandardErrort-ValueConfidence Interval 95%LowerUpperConstantCorrelation TableEFFECT OF MULTICOLLINEARITYBQ'SHdcc~
#
dMbP?_*+%MFCanonPCL6<h
odXLetterCanonY Canon iR1600-2000 PCL6ddd
d d
dd
dd@@d d
d dd d"edd
ddd
d!!d
!"#$ddA
A d
C8 o
**E
o
**E
AXXd2\SRGBCO~1.ICM\SRGBCO~1.ICM\SRGBCO~1.ICM CONFIDENTIALCONFIDENTIALHArialDefault Settings"dX??U(
f
c$@
A@k]`@h
t h< How to deal with Multicollinearity using StatTools.
If you recall a regression equation indicates the effect of explanatory variables on the response variables, provided that the other variables in the equation remain constant. Another way of stating this is that the coefficient represents the effect of this explanatory variable on the response variable in addition to the effects of the other variables in the equation. Therefore, the relationship between an explanatory variable X and the response variable Y depends on which other X ' s are included or not included in the equation.
This is especially true when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity. Multicollinearity, is defined as the presence of a fairly strong linear relationship between two or more explanatory variables , and it can make estimation difficult.
Consider the following example. It is a very simple example, but it definitely serves the purpose of demonstrating the warnings of and how to deal with and recognize multicollinearity.
The Problem
We want to explain a person s height by means of foot length. The response is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively.
The question we can ask ourselves is What can occur when we regress Height on both Right and Left?
To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths. To perform the analysis we will use StatTools, Palisade s new Microsoft Excel Add-in, for the regression analysis.
On first inspection of this problem, common sense dictates that there is no need to include both Right and Left in an equation for Height. We could choose either one right or left and either one would be sufficient. In this example, however, we include them to make a point about the dangers of multicollinearity.
After creating a correlation matrix in StatTools we notice there is a large correlation between height and foot size. Therefore we would expect this regression equation to do a good job.
Our intuition is correct; the R-squared value is .817. This R-squared value is relatively large and would then probably cause us to believe the relationship is very strong.
But what about the coefficients of Right and Left?
Here is where the problem begins. The coefficient of Right indicates the right foot s effect on Height in addition to the effect of the left foot. That is, after the effect of Left on Height has already been taken into account, the extra information provided by Right is probably minimal. This can go both ways regarding left and right.
We created the dataset so that except for random error, height is approximately 32 plus 3.2 times foot length (all expressed in inches). As shown in our correlation matrix using StatTools in Heights.xls, the correlation between Height and either Right or Left in our data set is quite large, and the correlation between Right and Left is very close to 1.
The regression output when both Right and Left entered in the equation for Height appears in Heights.xls. This tells a somewhat confusing story. The multiple R and the corresponding R-squared are about what we would expect, given the correlations between Height and either Right or Left in Heights.xls. In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the Standard Error value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.
However, the coefficients of Right and Left are not at all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. In fact, the coefficient of Left is the wrong sign-it is negative! Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p-value is quite large. We might conclude that Height and Left are either not related or are related negatively.< But we know from Heights.xls that both of these conclusions are false. In contrast, the coefficient of Right has the correct sign, and its f -value and associated p-value do imply statistical significance, at least at the 5% level. However, this happened mostly by chance. Slight changes in the data could change the results completely-the coefficient of Right could become negative and insignificant, or both coefficients could become insignificant. The problem is that although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. Note that the regression equation does estimate the combined effect fairly well-the sum of the coefficients of Right and Left is 6.823 + (-3.645) = 3.178. This is close to the coefficient 3.2, what we used to generate the data. Also, the estimated intercept 31.760 is close to the intercept 32 we used to generate the data. Therefore, the estimated equation will work well for predicting heights. It just does not have reliable estimates of the individual coefficients of Right and Left.
When Right is the only variable in the equation as seen in Heights.xls, it becomes
Predicted Height = 31.546 + 3.195*Right
The R-squared and Standard Error values are 81.6% and 2.005, and the t-value and p-value for the coefficient of Right are now 21 .34 and 0.0000-very significant. Similarly, when Left is the only variable in the equation, it becomes
Predicted Height = 31.526 + 3.197*Left
The R-squared and Standard Error values are 81.1% and 2.033, and the f-value and y-value for the coefficient of Left are 20.99 and 0.0000-again very significant. Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.
This example illustrates an extreme form of multicollinearity, where two explanatory variables are very highly correlated. In general, there are various degrees of multicollinearity. In each of them, there is a linear relationship between two or more explanatory variables, and this relationship makes it difficult to estimate the individual effect of the X s on the response variable.
Some common symptoms of multicollinearity can be:
1. Wrong signs of the coefficients,
2. smaller-than-expected t-values,
3. and larger-than-expected (insignificant) p-values.
In other words, variables that are really related to the response variable can look like they aren t related, based on their p-values. The reason is that their effects on Y are already explained by other X s in the equation. Sometimes multicollinearity is easy to spot and treat. For example, it would be silly to include both Right and Left foot length in the equation for Height as seen in our example. They are obviously very highly correlated and only one is needed in the equation for Height.
The solution then is to exclude one of them and re-estimate the equation. However, multicollinearity is not usually this easy to treat or even diagnose. Suppose, for example, that we want to use regression to explain variations in salary. Three potentially useful explanatory variables are age, years of experience in the company, and years of experience in the industry. It is very likely that each of these is positively related to salary, and it is also very likely that they are very closely related to each other. However: it isn t clear which, if any, we should exclude from the regression equation. If we include all three, we are likely to find that at least one of them is insignificant (high p-value), in which case we might consider excluding it from the equation. If we do so, the R-squared and Standard Error values will probably not change very much-the equation will provide equally good predicted values-but the coefficients of the variables that remain in the equation could change considerably.
**This example and text has been adapted from Managerial Statistics by Albright, Winston, Zappe published by Duxbury Thomson Learning for the purpose of this newsletter. Contact Palisade Co<Wrporation for details in ordering if you like this explanation of multicollinearity.
<h>3gHRS4
V
$4+Q"t >@7
ngu
dMbP?_*+%MHP LaserJet IIP@g,,@MSUDHP LaserJet IIPd
"/,,??U}}n @@ @@@@ @ @
@@@
@@@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
1>2
*
'%
3'
'(
')3@@@
++
(&4
('
(* f@!@"В@), bJ?,+@u)?,U-t?-y4{@ @!@"@ T@!X@"|@*
',
'.
'0
31
32 @!@"@
+#
(-
(/
(/
44 @!@@"H@
)3~
-@-E@-E@,:Z5l@ .ђ+8 E@!@"@
)4~
-Y@ -`ky@ -@
&&
@!$@",@ P@!@"@*
35
'6
38
32
59
5 @!`@"L@
+$4
(7
44
(:
(;
@!@"̓@
)<
-8ڒ?@
-N0%Y?
,j50@
.%M1-9
-9{$;@
-ccA@ @!t@"@
)-V1J@-Z^m@,RЧZ?.P4s8? -d?
-?+@ @@!L@"P@
)-NT(
-{N@,ג|.T͛mK? -Z$
-jCq @ m@! @"@ @!@"@*
'
'
' 4@!@"@
+=
(
(
( y@!h@"@
)~
/?
// @!@"@
)3/Do2? g9@99B~
/?/ J@!@"@
)3/Z2$? 9@99B3/F*{2? 9@99B~
/? @!@"@)/// @!T@"L@1000 i@!@"(@*
'%
3'
'(
') @!@"@
++
(&4
('
(* <@!@"ȑ@),2?,10Q?, {
?-ܨ
@ @!Ж@"@ G@!4@"@*
',
'.
'0
31
32 @!d@"@@
+#
(-
(/
(/
44 ѽ@!@"@
)3~
-?-{gY@-{gY@,mۋx|@ .CJ*ͨ7 `@!@"@
)4~
-Y@-T*y@-@
&& D
ll^nlbjvlPTFy.2^^nlb @!@"@ #@$%@&@'@(@)@ *@+@,@-@ .@/@0@1@2@ 3@4@56789:;<=>? @!@"@! @!t@"x@!*
!35
!'6
!38
!32
! 59!
5" 5@!x@"L@
"+$"4
"(7
"44
" (:
"
(;# @!@"@
#)<#-e?@#-~3?#, 6#-0@#.6!9# -:d;@#
-HA@$ @!Ĕ@"@
$)$-}= @$-3N)?$,a#W5@$.-ͨ7$ -
;/@$
-G"Z?@% ]@!@"ܘ@& +@!@"@' +@!@"ē@( @!T@"h@(*
('%
(3'
('(
(')) T@! @" @
)++
)(&)4
)('
)(** @!`@"h@*)*,$?*,M'?*,zNh?*-wC@+ @!'@"@, @!\@"t@,*
,',
,'.
,'0
,31
, 32- @!,@",@
-+#
-(-
-(/
-(/
-44 . @!T@"@@
.)3~
.-?.-q@.-q@.,f
<{@. .Khq7/ g@!@"@
/)4~
/-Y@/-@z@/-#>_2@
/&& 0 Լ@!Ȗ@"Ԗ@1 E@!,@"@1*
135
1'6
138
132
1 591
52 @!@"@
2+$24
2(7
244
2 (:
2
(;3 @!@"@
3)<3-@?@3-diCĽ?3,h/@3.Ht-93 -/hg;@3
-;S^A@4 @!@" @
4)4-Β @4-BLo~?4,*u
l4@4.iq74 -mj8(@4
-ޠc@5 @!`@"h@6 N@!4@"@7 @@!H@"@@8 E@!<@"8@9 )@!D@",@: (@!\@"X@; F@!@"@< @!@"@= S@!8@"8@> @!@"@? *@!Ĕ@"Д@DZlvl^^nlbjvl@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_@ @!D@"4@A @!̓@"@B s@!<@"X@C `@!@"@D v@!@"@E @!̒@"ܒ@F k@!l@"x@G @!@"@H @!0@"(@I $@!@"@J t@!)@"ē@K @!В@"(@L %@!+@"x@M @! @"$@N @!Ж@"@O @!Д@"Ԕ@P ż@!@"@Q @!@"@R N@!+@"(@S @! @" @T @!P@"`@U 4@!@"@V R@!@"@W @!̘@"@X 5@!,@"4@Y @!@"@Z Q@!@"@[ ͷ@!P@"`@\ `S@!$@"@] @!(@"P@^ Q@!Ē@"В@_ |@!X@"T@Dl`abcdefghijk lm` j@!@@"*@a j@!@"@b @!)@"@c R@!t@"d@d @!@"@e @!8@" @f Ծ@!8@"d@g к@!4@"@h @!@"@i _@!@"0@j @!@"@k#-@$Ȓ@%@
m v (h(
ZH
XP|?~i x]4@Hx
#|M1eLgG
\<]StatTools Note:
This is the correlation between the actual Y values and the fitted Y values.<
0\eg
Z
XP|?rx]4@
$ӀzCur&
v<wStatTools Educational Note:
This is the percentage of variation in the dependent variable explained by the regression.<
# @@v
Z(
XP|?i x]4@(
9mXC%iK
<StatTools Educational Note:
This value is useful as a monitor of new variables as they are added to the equation. If the Adjusted R-Square decreases, new variables should probably be omitted.<
8 @
Z
XP|?i
x]4@
bp6O1TS
<StatTools Educational Note:
The approximate standard deviation of the residuals. This is an estimate of prediction errors made by predicting new Y values from this regression equation.<
ah aah
ZL
XP|?i
x]4@L4!
|GS`H|q Yq
<StatTools Educational Note:
Ratio of the explained variation to the unexplained variation. It is large if the regression explains any significant amount. It has an F distribution under the null hypothesis of no explanatory power.<
ana s ofianc
Z
XP|?ix]4@t"
8 ZBmvxA
k<lStatTools Educational Note:
Reject null hypothesis of no explanatory power at all if this p-value is small.<
ah armkmo-
Zd
XP|?~
ix]4@d#
ܣ0E9!-VC
K<LStatTools Educational Note:
This column spells out the regression equation.<
|Pro pK in
Z
XP|?ix]4@$
diJASj
{<|StatTools Educational Note:
This is an indication of how much the regression coefficients would vary from sample to sample.<
ah a{fo!3
Zw
XP|?
i x]4 @w4&
h5-PB<
<StatTools Educational Note:
Ratio of the coefficient to its standard error. It is used to test the null hypothesis that the coefficient is 0, so that the variable has no effect on Y. It has a t distribution under the null hypothesis that the coefficient is 0.<
om Tmom
Zlw
XP|?
i
x]4
@lwt'
QyNMԥ3
<StatTools Educational Note:
Reject the null hypothesis that the coefficient is 0 if the corresponding p-value is small. Small p-values indicate the corresponding variables "belong" in the equation.<
lom
Z
XP|?~ix]4@(
4KBX1hk
\<]StatTools Note:
This is the correlation between the actual Y values and the fitted Y values.<
0\
eg
ZD
XP|?ix]4@D)
J2ь
v<wStatTools Educational Note:
This is the percentage of variation in the dependent variable explained by the regression.<
v
Z
XP|?i x]4
@4+
,6-KFFt
<StatTools Educational Note:
This value is useful as a monitor of new variables as they are added to the equation. If the Adjusted R-Square decreases, new variables should probably be omitted.<
ah a
Z$
XP|?i
x]4@$t,
}îCܱ
<StatTools Educational Note:
The approximate standard deviation of the residuals. This is an estimate of prediction errors made by predicting new Y values from this regression equation.<
om T
om
Z
XP|?i
"x]4@-
9@EuŇ,
<StatTools Educational Note:
Ratio of the explained variation to the unexplained variation. It is large if the regression explains any significant amount. It has an F distribution under the null hypothesis of no explanatory power.<
0eg
Z
XP|?i"x]4@.
QQIUK
k<lStatTools Educational Note:
Reject null hypothesis of no explanatory power at all if this p-value is small.<
eekףp=
ZP
XP|?~ i'x]4@P40
?xCk
K<LStatTools Educational Note:
This column spells out the regression equation.<
Keg
Z
XP|?!i(x]4@t1
jh
ДE{MB
{<|StatTools Educational Note:
This is an indication of how much the regression coefficients would vary from sample to sample.<
{
Z
XP|? i 'x]4@2
zeJ_ysP
<StatTools Educational Note:
Ratio of the coefficient to its standard error. It is used to test the null hypothesis that the coefficient is 0, so that the variable has no effect on Y. It has a t distribution under the null hypothesis that the coefficient is 0.<
Z|
XP|? i
'x]4@|3
Yg
6;WBBK
<StatTools Educational Note:
Reject the null hypothesis that the coefficient is 0 if the corresponding p-value is small. Small p-values indicate the corresponding variables "belong" in the equation.<
i(R@
Z
XP|?~(i/x]4@45
c{´LO`
\<]StatTools Note:
This is the correlation between the actual Y values and the fitted Y values.<
0 \T
ZD
XP|?'i.x]4 @Dt6
3܉FQC
v<wStatTools Educational Note:
This is the percentage of variation in the dependent variable explained by the regression.<
0 vT
!
Z
XP|?(i /x]4!@7
dW+CG]
<StatTools Educational Note:
This value is useful as a monitor of new variables as they are added to the equation. If the Adjusted R-Square decreases, new variables should probably be omitted.<
0 0T
"
Z
XP|?(i
/x]4"@8
Z:YJEu
<StatTools Educational Note:
The approximate standard deviation of the residuals. This is an estimate of prediction errors made by predicting new Y values from this regression equation.<
0 T
#
Zp
XP|?+i
2x]4#@p4:
e]RM+
<StatTools Educational Note:
Ratio of the explained variation to the unexplained variation. It is large if the regression explains any significant amount. It has an F distribution under the null hypothesis of no explanatory power.<
0 .T
$
Z
XP|?+i2x]4$@t;
+C=
k<lStatTools Educational Note:
Reject null hypothesis of no explanatory power at all if this p-value is small.<
0 kT
%
Z8
XP|?~0i7x]4%@8<
)(!F3!'
K<LStatTools Educational Note:
This column spells out the regression equation.<
0 KT
&
Z
XP|?1i8x]4&@=
ʋIIjh[%
{<|StatTools Educational Note:
This is an indication of how much the regression coefficients would vary from sample to sample.<
0 {T
'
Z
XP|?0i 7x]4'@4?
qY{EFXfw
<StatTools Educational Note:
Ratio of the coefficient to its standard error. It is used to test the null hypothesis that the coefficient is 0, so that the variable has no effect on Y. It has a t distribution under the null hypothesis that the coefficient is 0.<
0 T
(
Zd
XP|?0i
7x]4(@dt@
2W.K|z
<StatTools Educational Note:
Reject the null hypothesis that the coefficient is 0 if the corresponding p-value is small. Small p-values indicate the corresponding variables "belong" in the equation.<
0 TPalisadePalisadePalisadePalisadePalisade PalisadePalisade Palisade
PalisadePalisadePalisadePalisade
PalisadePalisadePalisade Palisade!Palisade!Palisade!Palisade"Palisade( Palisade)Palisade)!Palisade)"Palisade,#Palisade, $Palisade1%Palisade1'Palisade1(Palisade2&Palisade>@(((
!"!"!"!!
(),-,- 12121211
7
W
dMbP?_*+%M
Phaser 6250DP4 S
odXXLetterPRIV0''''p\KhCIpXORXHWaterMarkHelvetica"dXX??U}}
!{Gz'@ <g[k
~
?
?
~
@
@
!Q@ [k
@
!(\B+@ [k
@
!!p=
ף)@ [k
"*h3+Z3Z3Z3
">@7
\pBill [BbData, ST_Height;k*ST_Left;k+ST_Right;k@,STWBD_StatToolsCorrAndCovar_CorrelationTableTRUE@+STWBD_StatToolsCorrAndCovar_CovarianceTableFALSE>*STWBD_StatToolsCorrAndCovar_HasDefaultInfoTRUE<*STWBD_StatToolsCorrAndCovar_TableStructure 29(STWBD_StatToolsCorrAndCovar_VariableListG*STWBD_StatToolsCorrAndCovar_VariableList_1
UVG387AE4D9G*STWBD_StatToolsCorrAndCovar_VariableList_2
UVG2B59F32FF*STWBD_StatToolsCorrAndCovar_VariableList_3UVG74D9354N5STWBD_StatToolsCorrAndCovar_VarSelectorDefaultDataSet DG1D071105#STWBD_StatToolsRegression_blockList-1=)STWBD_StatToolsRegression_ConfidenceLevel .95;'STWBD_StatToolsRegression_FValueToEnter 2.2;'STWBD_StatToolsRegression_FValueToLeave 1.1M8STWBD_StatToolsRegression_GraphFittedValueVsActualYValueFALSEG2STWBD_StatToolsRegression_GraphFittedValueVsXValueFALSEI4STWBD_StatToolsRegression_GraphResidualVsFittedValueFALSED/STWBD_StatToolsRegression_GraphResidualVsXValueFALSE<(STWBD_StatToolsRegression_HasDefaultInfoTRUE@+STWBD_StatToolsRegression_IncludePredictionFALSE;&STWBD_StatToolsRegression_IncludeStepsFALSE:(STWBD_StatToolsRegression_NumberOfBlocks 0;'STWBD_StatToolsRegression_pValueToEnter .05:'STWBD_StatToolsRegression_pValueToLeave .1:(STWBD_StatToolsRegression_RegressionType 0<'STWBD_StatToolsRegression_throughOriginFALSE8#STWBD_StatToolsRegression_useFValueFALSE7#STWBD_StatToolsRegression_usePValueTRUEH+STWBD_StatToolsRegression_VariableDependent
UVG387AE4D9B1STWBD_StatToolsRegression_VariableListIndependentO3STWBD_StatToolsRegression_VariableListIndependent_1UVG74D9354L3STWBD_StatToolsRegression_VarSelectorDefaultDataSet DG1D07110=h;#
8X@"1Arial1Arial1Arial1Arial1Arial1Arial1Arial1Arial1Tahoma1Tahoma1Arial1Arial1Arial1Arial"$"#,##0_);\("$"#,##0\)"$"#,##0_);[Red]\("$"#,##0\) "$"#,##0.00_);\("$"#,##0.00\)%""$"#,##0.00_);[Red]\("$"#,##0.00\)5*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_),))_(* #,##0_);_(* \(#,##0\);_(* "-"_);_(@_)=,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)4+1_(* #,##0.00_);_(* \(#,##0.00\);_(* "-"??_);_(@_) 0.00000.000[<0.0001]"< 0.0001";0.0000 + ) , * ! ! #p( 6@#p( #p( ` @ d @ d @ ` @ d @ d @ ` @ d @ d @ " 1" 1"< 1! 1! 1!< " " " " 1 0 1& Hyperlink83ffff̙3f3fff3f3f33333f33333Instructions?Data|_STDS_DG1D07110
T
dMbP?_*+%M CanonPCL6
0<h
odXLetter`
0{0|6
0CanonY Canon iR1600-2000 PCL6ddd
d d
dd
dd@@d d
d dd d"edd
ddd
d!!d
!"#$ddA
A d
C8 o
**E
o
**E
AXXd2\SRGBCO~1.ICM\SRGBCO~1.ICM\SRGBCO~1.ICM CONFIDENTIALCONFIDENTIALHArialDefault Settings<"dX??U
] k A@t hpHow to deal with Multicollinearity using StatTools.
If you recall a regression equation indicates the effect of explanatory variables on the response variables, provided that the other variables in the equation remain constant. Another way of stating this is that the coefficient represents the effect of this explanatory variable on the response variable in addition to the effects of the other variables in the equation. Therefore, the relationship between an explanatory variable X and the response variable Y depends on which other X ' s are included or not included in the equation.
This is especially true when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity. Multicollinearity, is defined as the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult.
Consider the following example. It is a very simple example, but it definitely serves the purpose of demonstrating the warnings of and how to deal with and recognize multicollinearity.
The Problem
We want to explain a persons height by means of foot length. The response is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively.
The question we can ask ourselves is What can occur when we regress Height on both Right and Left?
To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths. To perform the analysis we will use StatTools, Palisades new Microsoft Excel Add-in, for the regression analysis.
On first inspection of this problem, common sense dictates that there is no need to include both Right and Left in an equation for Height. We could choose either one right or left and either one would be sufficient. In this example, however, we include t>3gHRS4
V
$4+Q"t < hem to make a point about the dangers of multicollinearity.
After creating a correlation matrix in StatTools we notice there is a large correlation between height and foot size. Therefore we would expect this regression equation to do a good job.
Our intuition is correct; the R-squared value is .817. This R-squared value is relatively large and would then probably cause us to believe the relationship is very strong.
But what about the coefficients of Right and Left?
Here is where the problem begins. The coefficient of Right indicates the right foots effect on Height in addition to the effect of the left foot. That is, after the effect of Left on Height has already been taken into account, the extra information provided by Right is probably minimal. This can go both ways regarding left and right.
We created the dataset so that except for random error, height is approximately 32 plus 3.2 times foot length (all expressed in inches). As shown in our correlation matrix using StatTools in Heights.xls, the correlation between Height and either Right or Left in our data set is quite large, and the correlation between Right and Left is very close to 1.
The regression output when both Right and Left entered in the equation for Height appears in Heights.xls. This tells a somewhat confusing story. The multiple R and the corresponding R-squared are about what we would expect, given the correlations between Height and either Right or Left in Heights.xls. In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the Standard Error value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.
However, the coefficients of Right and Left are not at all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. In fact, the coefficient of Left is the wrong sign-it is negative! Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the c< orresponding p-value is quite large. We might conclude that Height and Left are either not related or are related negatively. But we know from Heights.xls that both of these conclusions are false. In contrast, the coefficient of Right has the correct sign, and its f -value and associated p-value do imply statistical significance, at least at the 5% level. However, this happened mostly by chance. Slight changes in the data could change the results completely-the coefficient of Right could become negative and insignificant, or both coefficients could become insignificant. The problem is that although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. Note that the regression equation does estimate the combined effect fairly well-the sum of the coefficients of Right and Left is 6.823 + (-3.645) = 3.178. This is close to the coefficient 3.2, what we used to generate the data. Also, the estimated intercept 31.760 is close to the intercept 32 we used to generate the data. Therefore, the estimated equation will work well for predicting heights. It just does not have reliable estimates of the individual coefficients of Right and Left.
When Right is the only variable in the equation as seen in Heights.xls, it becomes
Predicted Height = 31.546 + 3.195*Right
The R-squared and Standard Error values are 81.6% and 2.005, and the t-value and p-value for the coefficient of Right are now 21 .34 and 0.0000-very significant. Similarly, when Left is the only variable in the equation, it becomes
Predicted Height = 31.526 + 3.197*Left
The R-squared and Standard Error values are 81.1% and 2.033, and the f-value and y-value for the coefficient of Left are 20.99 and 0.0000-again very significant. Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.
This example illustrates an extreme form of multicollinearity, where two explanatory variables are very highly correlated. In g< eneral, there are various degrees of multicollinearity. In each of them, there is a linear relationship between two or more explanatory variables, and this relationship makes it difficult to estimate the individual effect of the X s on the response variable.
Some common symptoms of multicollinearity can be:
1. Wrong signs of the coefficients,
2. smaller-than-expected t-values,
3. and larger-than-expected (insignificant) p-values.
In other words, variables that are really related to the response variable can look like they arent related, based on their p-values. The reason is that their effects on Y are already explained by other X s in the equation. Sometimes multicollinearity is easy to spot and treat. For example, it would be silly to include both Right and Left foot length in the equation for Height as seen in our example. They are obviously very highly correlated and only one is needed in the equation for Height.
The solution then is to exclude one of them and re-estimate the equation. However, multicollinearity is not usually this easy to treat or even diagnose. Suppose, for example, that we want to use regression to explain variations in salary. Three potentially useful explanatory variables are age, years of experience in the company, and years of experience in the industry. It is very likely that each of these is positively related to salary, and it is also very likely that they are very closely related to each other. However: it isnt clear which, if any, we should exclude from the regression equation. If we include all three, we are likely to find that at least one of them is insignificant (high p-value), in which case we might consider excluding it from the equation. If we do so, the R-squared and Standard Error values will probably not change very much-the equation will provide equally good predicted values-but the coefficients of the variables that remain in the equation could change considerably.
**This example and text has been adapted from Managerial Statistics by Albright, Winston, Zappe published by Duxbu<ry Thomson Learning for the purpose of this newsletter. Contact Palisade Corporation for details in ordering if you like this explanation of multicollinearity.
=h;#
8X>
nA6R^'eg
dMbP?_*+%MHP LaserJet IIPg for the purpos@g,, Corporation for details in ord@MSUDHP LaserJet IIPd
"/,,??:#StatCorrelationCoeffU}tC}tC
n @@ @@@@ @ @
@@@
@@@
@
@
@
@
@
@
@
@
@
@
@
@
@
@C;Relationship between height and left and right foot lengths#1EFFECT OF MULTICOLLINEARITY2Height
RightLeft*'Multiple3R-Square'Adjusted'StErr of3@@@+Summary (R4(R-Square(Estimate f@!@"В@), bJ?,+@u)?,U-t?-y4{@ @!@"@ T@!X@"|@*'
Degrees of'Sum of'Mean of 3F-Ratio 3p-Value @!@"@+ANOVA Table(Freedom(Squares(Squares
44 @!@@"H@) Explained~
-@-E@-E@,:Z5l@ .ђ+8 E@!@"@ )Unexplained~
-Y@ -`ky@ -@
&&
@!$@",@ P@!@"@*3Coefficient'Standard3t-Value3p-Value 5Confidence Interval 95%
5 @!`@"L@+Regression Table4
(Error
44
(Lower
(Upper
@!@"̓@
)Constant
-8ڒ?@
-N0%Y?
,j50@
.%M1-9
-9{$;@
-ccA@ @!t@"@
)Right-V1J@-Z^m@,RЧZ?.P4s8? -d?
-?+@ @@!L@"P@)Left-NT(
-{N@,ג|.T͛mK? -Z$
-jCq @ m@! @"@ @!@"@*'Height
'Right'Left 4@!@"@+Correlation Table(Height Data(Height Data(Height Data y@!h@"@)Height~
/?
// @!@"@
)Righti/Do2? gS9v@9,l97EB~
/?/ J@!@"@)Lefti/Z2$? S9v@9,D97EBi/F*{2? S9v@9,D97ElB~
/? @!@"@)/// @!T@"L@1000 i@!@"(@*'Multiple3R-Square'Adjusted'StErr of @!@"@+Summary (R4(R-Square(Estimate <@!@"ȑ@),2?,10Q?, {
?-ܨ
@ @!Ж@"@ G@!4@"@*'
Degrees of'Sum of'Mean of 3F-Ratio 3p-Value @!d@"@@+ANOVA Table(Freedom(Squares(Squares
44 ѽ@!@"@) Explained~
-?-{gY@-{gY@,mۋx|@ .CJ*ͨ7 `@!@"@)Unexplained~
-Y@-T*y@-@
&& DVlG1nnzsY~J.2vnnz @!@"@ #@$%@&@'@(@)@ *@+@,@-@ .@/@0@1@2@ 3@4@56789:;<=>? @!@"@! @!t@"x@!*!3Coefficient!'Standard!3t-Value!3p-Value! 5Confidence Interval 95%!
5" 5@!x@"L@"+Regression Table"4
"(Error
"44
" (Lower
"
(Upper# @!@"@#)Constant#-e?@#-~3?#, 6#-0@#.6!9# -:d;@#
-HA@$ @!Ĕ@"@
$)Right$-}= @$-3N)?$,a#W5@$.-ͨ7$ -
;/@$
-G"Z?@% ]@!@"ܘ@& +@!@"@' +@!@"ē@( @!T@"h@(*('Multiple(3R-Square('Adjusted('StErr of) T@! @" @)+Summary )(R)4)(R-Square)(Estimate* @!`@"h@*)*,$?*,M'?*,zNh?*-wC@+ @!'@"@, @!\@"t@,*,'
Degrees of,'Sum of,'Mean of ,3F-Ratio, 3p-Value- @!,@",@-+ANOVA Table-(Freedom-(Squares-(Squares
-44 . @!T@"@@.) Explained~
.-?.-q@.-q@.,f
<{@. .Khq7/ g@!@"@/)Unexplained~
/-Y@/-@z@/-#>_2@
/&& 0 Լ@!Ȗ@"Ԗ@1 E@!,@"@1*13Coefficient1'Standard13t-Value13p-Value1 5Confidence Interval 95%1
52 @!@"@2+Regression Table24
2(Error
244
2 (Lower
2
(Upper3 @!@"@3)Constant3-@?@3-diCĽ?3,h/@3.Ht-93 -/hg;@3
-;S^A@4 @!@" @4)Left4-Β @4-BLo~?4,*u
l4@4.iq74 -mj8(@4
-ޠc@5 @!`@"h@6 N@!4@"@7 @@!H@"@@8 E@!<@"8@9 )@!D@",@: (@!\@"X@; F@!@"@< @!@"@= S@!8@"8@> @!@"@? *@!Ĕ@"Д@Dalvnnzs@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_@ @!D@"4@A @!̓@"@B s@!<@"X@C `@!@"@D v@!@"@E @!̒@"ܒ@F k@!l@"x@G @!@"@H @!0@"(@I $@!@"@J t@!)@"ē@K @!В@"(@L %@!+@"x@M @! @"$@N @!Ж@"@O @!Д@"Ԕ@P ż@!@"@Q @!@"@R N@!+@"(@S @! @" @T @!P@"`@U 4@!@"@V R@!@"@W @!̘@"@X 5@!,@"4@Y @!@"@Z Q@!@"@[ ͷ@!P@"`@\ `S@!$@"@] @!(@"P@^ Q@!Ē@"В@_ |@!X@"T@Dl`abcdefghijk lm` j@!@@"*@a j@!@"@b @!)@"@c R@!t@"d@d @!@"@e @!8@" @f Ծ@!8@"d@g к@!4@"@h @!@"@i _@!@"0@j @!@"@k#-@$Ȓ@%@
m v|vStatTools Educational Note:
This is the percentage of variation in the dependent variable explained by the regression.b\StatTools Note:
This is the correlation between the actual Y values and the fitted Y values.StatTools Educational Note:
This value is useful as a monitor of new variables as they are added to the equation. If the Adjusted R-Square decreases, new variables should probably be omitted.StatTools Educational Note:
The approximate standard deviation of the residuals. This is an estimate of prediction errors made by predicting new Y values from this regression equation.StatTools Educational Note:
Ratio of the explained variation to the unexplained variation. It is large if the regression explains any significant amount. It has an F distribution under the null hypothesis of no explanatory power.q kStatTools Educational Note:
Reject null hypothesis of no explanatory power at all if this p-value is small.QKStatTools Educational Note:
This column spells out the regression equation. StatTools Educational Note:
Ratio of the coefficient to its standard error. It is used to test the null hypothesis that the coefficient is 0, so that the variable has no effect on Y. It has a t distribution under the null hypothesis that the coefficient is 0.StatTools Educational Note:
Reject the null hypothesis that the coefficient is 0 if the corresponding p-value is small. Small p-values indicate the corresponding variables "belong" in the equation.{StatTools Educational Note:
This is an indication of how much the regression coefficients would vary from sample to sample.|vStatTools Educational Note:
This is the percentage of variation in the dependent variable explained by the regression.b\StatTools Note:
This is the correlation between the actual Y values and the fitted Y values.StatTools Educational Note:
This value is useful as a monitor of new variables as they are added to the equation. If the Adjusted R-Square decreases, new variables should probably be omitted.StatTools Educational Note:
The approximate standard deviation of the residuals. This is an estimate of prediction errors made by predicting new Y values from this regression equation.StatTools Educational Note:
Ratio of the explained variation to the unexplained variation. It is large if the regression explains any significant amount. It has an F distribution under the null hypothesis of no explanatory power.q kStatTools Educational Note:
Reject null hypothesis of no explanatory power at all if this p-value is small.Q!KStatTools Educational Note:
This column spells out the regression equation. !StatTools Educational Note:
Ratio of the coefficient to its standard error. It is used to test the null hypothesis that the coefficient is 0, so that the variable has no effect on Y. It has a t distribution under the null hypothesis that the coefficient is 0.!StatTools Educational Note:
Reject the null hypothesis that the coefficient is 0 if the corresponding p-value is small. Small p-values indicate the corresponding variables "belong" in the equation."{StatTools Educational Note:
This is an indication of how much the regression coefficients would vary from sample to sample.|(vStatTools Educational Note:
This is the percentage of variation in the dependent variable explained by the regression.b)\StatTools Note:
This is the correlation between the actual Y values and the fitted Y values.)StatTools Educational Note:
This value is useful as a monitor of new variables as they are added to the equation. If the Adjusted R-Square decreases, new variables should probably be omitted.)StatTools Educational Note:
The approximate standard deviation of the residuals. This is an estimate of prediction errors made by predicting new Y values from this regression equation.,StatTools Educational Note:
Ratio of the explained variation to the unexplained variation. It is large if the regression explains any significant amount. It has an F distribution under the null hypothesis of no explanatory power.q, kStatTools Educational Note:
Reject null hypothesis of no explanatory power at all if this p-value is small.Q1KStatTools Educational Note:
This column spells out the regression equation. 1StatTools Educational Note:
Ratio of the coefficient to its standard error. It is used to test the null hypothesis that the coefficient is 0, so that the variable has no effect on Y. It has a t distribution under the null hypothesis that the coefficient is 0.1StatTools Educational Note:
Reject the null hypothesis that the coefficient is 0 if the corresponding p-value is small. Small p-values indicate the corresponding variables "belong" in the equation.2{StatTools Educational Note:
This is an indication of how much the regression coefficients would vary from sample to sample.=h;#
8X>
((("
dMbP?_*+%M Phaser 6250DPucational Note:
4 S
odXXLetterion coefficients would vaPRIV0''''p\KhCIpXORXHWaterMarkHelvetica<"dXX??DataU}}
NameHeight DataGUID DG1D07110Format RangeVariable LayoutColumnsVariable Names In Cells#Variable Names In 2nd CellsData Set Ranges+{Gz'@ <g[kData Sheet Format~
?Formula Eval Cell? Num Stored Vars~
@1 : Info
VG387AE4D9var1 ST_Height@
1 : Ranges+Q@ [k
1 : MultiRefs2 : Info
VG2B59F32Fvar2ST_Right@
2 : Ranges+(\B+@ [k
2 : MultiRefs3 : Info VG74D9354var3ST_Left@
3 : Ranges+p=
ף)@ [k
3 : MultiRefs*h'%$./3J+:)qEpEnE=h;#
8X>
"
Oh+'0@H`p
Chris AlbrightBillMicrosoft Excel@J9@0՜.+,D՜.+,P8@\d
lIndiana University
InstructionsData_STDS_DG1D07110
ST_HeightST_Left ST_RightWorksheets
Named Ranges4@(_AdHocReviewCycleID_EmailSubject
_AuthorEmail_AuthorEmailDisplayName_ReviewingToolsShownOnce,ntech-support@palisade.comTechnical Support
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghiklmnopqrstuvwxyz{|}~Root EntryF!Book
j3WorkbookSummaryInformation(DocumentSummaryInformation8