Multiple Regression
Analysis and Interpretation
Pair Plot to see the correlations among variables
pairs(attitude)
Find the best model
out=lm(rating~., data=attitude)
out2=lm(rating~complaints+learning+advance, data=attitude)
out3=lm(rating~complaints+learning, data=attitude)
anova(out3, out2, out)
## Analysis of Variance Table
##
## Model 1: rating ~ complaints + learning
## Model 2: rating ~ complaints + learning + advance
## Model 3: rating ~ complaints + privileges + learning + raises + critical +
## advance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 1254.7
## 2 26 1179.1 1 75.540 1.5121 0.2312
## 3 23 1149.0 3 30.109 0.2009 0.8947
Attitude is the data from RStudio. We can get some valuable example to find out the best model by following the step. Using ANOVA, anova(small model, large model), we can compare each model through the F-stat.
summary(out3)
##
## Call:
## lm(formula = rating ~ complaints + learning, data = attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5568 -5.7331 0.6701 6.5341 10.3610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.8709 7.0612 1.398 0.174
## complaints 0.6435 0.1185 5.432 9.57e-06 ***
## learning 0.2112 0.1344 1.571 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.817 on 27 degrees of freedom
## Multiple R-squared: 0.708, Adjusted R-squared: 0.6864
## F-statistic: 32.74 on 2 and 27 DF, p-value: 6.058e-08
In this example, the regression model 3, (rating=complaints+learning) is the best among three models.
Another example
We can get the birth weight data from MASS library. To verify which model is better, we are going to run anova() function as before.
library(MASS)
out=lm(bwt~age+lwt+factor(race)+smoke+ptl+ht+ui, data=birthwt)
out2=lm(bwt~lwt+factor(race)+smoke+ht+ui, data=birthwt)
anova(out2, out)
## Analysis of Variance Table
##
## Model 1: bwt ~ lwt + factor(race) + smoke + ht + ui
## Model 2: bwt ~ age + lwt + factor(race) + smoke + ptl + ht + ui
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 182 75937505
## 2 180 75741025 2 196480 0.2335 0.792
If we remove two variables, ptl and age then the difference in the significance of the model improves a lot.
anova(out2)
summary(out2)
## Analysis of Variance Table
##
## Response: bwt
Df Sum Sq Mean Sq F value Pr(>F)
## lwt 1 3448639 3448639 8.2654 0.0045226 **
## factor(race) 2 5076610 2538305 6.0836 0.0027701 **
## smoke 1 6281818 6281818 15.0557 0.0001458 ***
## ht 1 2871867 2871867 6.8830 0.0094402 **
## ui 1 6353218 6353218 15.2268 0.0001341 ***
## Residuals 182 75937505 417239
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Call:
## lm(formula = bwt ~ lwt + factor(race) + smoke + ht + ui, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842.14 -433.19 67.09 459.21 1631.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2837.264 243.676 11.644 < 2e-16 ***
## lwt 4.242 1.675 2.532 0.012198 *
## factor(race)2 -475.058 145.603 -3.263 0.001318 **
## factor(race)3 -348.150 112.361 -3.099 0.002254 **
## smoke -356.321 103.444 -3.445 0.000710 ***
## ht -585.193 199.644 -2.931 0.003810 **
## ui -525.524 134.675 -3.902 0.000134 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 645.9 on 182 degrees of freedom
## Multiple R-squared: 0.2404, Adjusted R-squared: 0.2154
## F-statistic: 9.6 on 6 and 182 DF, p-value: 3.601e-09
Written on February 17, 2020