Material of Tue 19 March2019, week 3
First import the data and investigate it:
adv <- read.csv(file = "Datasets/Advertising.csv", header = TRUE, sep = ",")head(adv)##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2    writeLines("\n")    print("***Dimensions***")## [1] "***Dimensions***"    writeLines("\n")dim(adv)## [1] 200   4    writeLines("\n")    print("***Summary***")## [1] "***Summary***"    writeLines("\n")summary(adv)##        TV             Radio          Newspaper          Sales      
##  Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
##  1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:10.38  
##  Median :149.75   Median :22.900   Median : 25.75   Median :12.90  
##  Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :14.02  
##  3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:17.40  
##  Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00    writeLines("\n")    print("***Structure***")## [1] "***Structure***"    writeLines("\n")str(adv)## 'data.frame':    200 obs. of  4 variables:
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ Radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ Newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ Sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...    writeLines("\n")Fom this we can tell that there is one output, with 3 input values and 200 Observations.
pairs(x = adv)In order to use corrplot first create a correlation matrix using cor(adv) then feed that matrix to corrplot with the command corrplot(cor(adv))
#coriris <- cor(iris[,!(names(iris) == "Species")])
#corrplot(method = 'ellipse', type = 'lower', corr = coriris)
corMat <- cor(x = adv)
corrplot(corr = corMat, method = "ellipse", type = "lower")  
From this we can see that there is a significant amount of correlation between Radio and newspaper (moreso even that newspaper and sales), we should consider this variable interaction when decide upon our model
A multiple linear regression would give the model:
advModMult <- lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
advModMult## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
## 
## Coefficients:
## (Intercept)           TV        Radio    Newspaper  
##    2.938889     0.045765     0.188530    -0.001037This gives that the appropriate linear model is:
\[ Y_{Sales} = 0.0458 \times \text{TV} + 0.19 \times \text{Radio} - 0.001 \times \text{Newspaper} \] the fact that Newspaper has a negative coefficient despite being positively correlated with sales is indicative of the weak effect newspaper advertising has on sales as well as the interaction between newspaper and sales.
from this it can be ### (d) Test the significance of the parameters and find the resulting model to model Sales in terms of advertising modes, TV, Radio and Newspaper. First Summarise the Model:
advModMult %>% summary()## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16A summary of the model provides that given the hypothesis test: \[ H_0: \enspace \beta_i = 0\\ H_a: \enspace \beta_i \neq 0 \\ \qquad \qquad \qquad \qquad \forall i \in \mathbb{N} \]
There would be an extremely low probability of incorrectly rejecting the null hypothesis that the given a linear model the coefficients would be zero, hence it is accepted that the coefficients are non-zero except for newspaper advertising.
There is a high probability of incorrectly rejecting the null-hypothesis and hence that should not be rejected and the nespaper advertising should not be seen as significant predictor for sales in this linear model.
It is hence approprate to remove, via backwards selection, Newspaper from the model, which gives:
## Calculate Power??
# n <-  length(adv)
# sigma <-  sd(adv$Newspaper)
# sem = sigma/sqrt(n)
# alpha <-  0.05
# mu0 <- 0
# q <- qnorm(p = 0.005, mean = mu0, sd = sem);q
# mu <- q #assumed actual mean value
# pnorm(q, mean = mu, sd = sem, lower.tail = FALSE)advModMultb1 <- lm(formula = Sales ~ TV + Radio, data = adv)
advModMultb1## 
## Call:
## lm(formula = Sales ~ TV + Radio, data = adv)
## 
## Coefficients:
## (Intercept)           TV        Radio  
##     2.92110      0.04575      0.18799advModMultb1.sum <- summary(advModMultb1)\[ \text{Sales} = 0.045 \times \text{TV} + 0.19 \times \text{Radio} + 2.9211 \]
All parameters are highly significant and hence the model is deemed adequate, the model explains 100% of the variation, this can be found by appending $r.squared to the model object.
In order to assess the model, consider the anova table and derive the RMSE and \(R^2\) values:
advModMultb1.anova <- anova(advModMultb1)
advModMultb1.anova## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## TV          1 3314.6  3314.6 1172.50 < 2.2e-16 ***
## Radio       1 1545.6  1545.6  546.74 < 2.2e-16 ***
## Residuals 197  556.9     2.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1From this we can see that that the \(F\) statistic is associated with a very low p-value, these are the exact same p-values from the summary call.
The RMSE value is the Root mean square error, it is the standard error of the residuals (recall that the standard error is the standard deviation of a model parameter), so the RMSE is basically the standard deviation of the residuals of the model (as measured along the \(y\)-axis).
\[ \text{RMSE} = \sqrt{\frac{\sum{\varepsilon^2}}{n}} \]
rmse <- function(model){
# (sum(model$residuals**2)/length(model$residuals))**0.5
sd(advModMultb1$residuals)
}
rse <- function(model){
  var(model$residuals)/var(model$residuals - model$fitted.values)
}
data.frame("RMSE" = rmse(advModMultb1) , "RSE" = rse(advModMultb1) )##       RMSE       RSE
## 1 1.672891 0.1028057So the expected error of the model is \(\pm\) 1.7 units of sale, we could use this to create a confidence interval by multiplying by the corresponding Student’s t-statistic and saying that we would expect an observed value to lie within \(1.96 * \text{S.E}\) of the model 95% of the time (is this correct or is it the expected mean or something because of the Central Limit Theorem?).
The coefficient of determination is 89.7%, this can be returned by extracting the value from the object by apending $r.squared to the model summary (i.e. use summary(lm( Y ~ X1 + X2  )))$r.squared).
The \(R^2\) value will be very nearly the same between the initial model and the backwards selected model, however given that the initial model had non-significant predictors it could be considered as over-parameterized, i.e. it violates Occam’s Razor.
The coefficient of determination is the ratio of the total variance of the data that is explained by the model, in this case it could be determined by:
\[ R^2 = \frac{TSS-SSE}{TSS} = \frac{3314.6+1546+0.1}{3315+1545+0.1+556.8} = 89\% \ \\ \]
The \(R^2\) value is derived from the ANOVA table thusly:
advModMultb1.anova <- anova(advModMultb1)
TSS_Multb1 <- advModMultb1.anova$`Sum Sq` %>% sum()
RSS_Multb1 <- advModMultb1$residuals^2 %>% sum()
((TSS_Multb1- RSS_Multb1)/(TSS_Multb1)) %>% signif(3) %>% percent()   #Requires `scales` package## [1] "89.7%"When determining the accuracy or performance of amodel, always turn your mind to residual analysis (i.e. how normally are they distributed), this will be performed below.
The residuals and fitted values may be returned by extracting them from the model (you could always derive them from first principles but you should use the tools at your disposal, it is quicker, less error prone and makes more readable code)
ResDF <- data.frame("Input" = advModMultb1$fitted.values - advModMultb1$residuals, "Output" = advModMultb1$fitted.values, "Error" = advModMultb1$residuals) 
ResDF %>% head()##      Input   Output      Error
## 1 19.01093 20.55546  1.5445354
## 2 14.29072 12.34536 -1.9453623
## 3 15.37404 12.33702 -3.0370177
## 4 16.73423 17.61712  0.8828840
## 5 13.54782 13.22391 -0.3239081
## 6 17.82417 12.51208 -5.3120845Predict will also return fitted values if no newdata is specified.
ggplot(data = ResDF, aes(x = Output, y = Error, col = Error )) +
  geom_point() +
  theme_bw() +
  stat_smooth(col = "grey")+
  stat_smooth(method = "lm", se = 0, ) +
scale_color_gradient(low = "indianred", high = "royalblue") +
  labs(y = "Residuals", x = "Predicted Value", title = "Residuals Plotted against Output")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'plot(Error ~ Output, data = ResDF, pch = 19, col = "IndianRed", main = "Residuals against Predictions")
smooth <- loess(formula = Error ~ Output, data = ResDF, span = 0.8)
predict(smooth) %>% lines(col = "Blue")
abline(lm(Error ~ Output, data = ResDF), col = "Purple")From this it can be seen that the residuals are centred around zero, but they are not normally distributed, the model performance appears to fail normality assumptions for values \(\in \left( 10, 20 \right)\)
ggplot(data = ResDF, aes(x = Error, col = Output)) +
  geom_histogram() +
  theme_classic() +
  labs(y = "Absolute Count", x = "Residual Value", title = "Residual Distribution") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(ResDF, aes(x=Error)) +
  geom_histogram(aes(y = stat(density))) +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(df$x), sd = sd(df$x)), 
    lwd = 2, 
    col = 'IndianRed'
  ) +
  theme_classic() +
  labs(title = "Histogram of Residuals", y = "Density of Observations", x= "Residuals")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.We can visualise the residuals as white noise:
# Put some White Noise on the ResDF
ResDF <- cbind(ResDF, "Wnoise" = rnorm(nrow(ResDF), mean = 0, sd = sd(ResDF$Error)))
## Our Residuals
ggplot(ResDF, aes(x = Input, y = Error, col = Output)) +
  geom_line() +
  geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
  theme(legend.position = "none") + 
  labs(title = "Distribution of Residuals")## White Noise
ggplot(ResDF, aes(x = Input, y = Wnoise, col = Output)) +
  geom_line() +
  geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
  theme(legend.position = "none") + 
  labs(title = "Distribution of Residuals")This clearly shows that the Residuals are non-normal.
hist(ResDF$Error
     , breaks = 30,
     prob=TRUE,
     lwd=2,
    main = "Residual Distribution",
    xlab = "Distance from the Model", border = "purple"
    ) 
  #Overlay the Normal Dist Curve
x <- 1:100    #Stupid base package wants some f(x), this is hacky but easier than stuffing around with lines or defining a function
curve(dnorm(x, mean(ResDF$Error), sd(ResDF$Error)), add=TRUE, col="purple", lwd=2) #Draws the actual density function
                
#lines(density(possibleyvals_conf), col='purple', lwd=2) #Draws the observed density function
lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))
abline(v=upr_conf, col='pink', lwd=3)  
abline(v=lwr_conf, col='pink', lwd=3)     
abline(v=mean(ResDF$Error), lwd=2, lty='dotted')  lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))It hence appears that the Residuals skewed left, which suggests an upper bound on the residual value, the histogram is sufficiently non-normal to reject the assumption that the residuals are normally distributed
The Residuals can be analised by plotting the model:
plot(advModMultb1)This is an inappropriate model because the residuals are non-normally distributed.
newdata =  data.frame("TV" = c(3, 1, 2), "Radio" = c(4, 5, 6))
Mod_Adv_Mult_b1 <- predict(object = advModMultb1, newdata)
mypreds <- data.frame("Input" = newdata, "Output" =  Mod_Adv_Mult_b1)
mypreds # Careful with the names, they workout as Input.name in this case.##   Input.TV Input.Radio   Output
## 1        3           4 3.810341
## 2        1           5 3.906826
## 3        2           6 4.140575reconsider the correlationplots:
pairs(x = adv)adv[, names(adv)!="Newspaper"] %>% cor() %>% corrplot(method = "ellipse", type = "lower")From this we can determine that there is no interaction between TV and radio (however previously there was interaction between newspaper and Radio, we should consider that next)
int_mod_adv <- lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
int_mod_adv## 
## Call:
## lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
## 
## Coefficients:
## (Intercept)           TV        Radio     TV:Radio  
##    6.750220     0.019101     0.028860     0.001086summary(int_mod_adv)## 
## Call:
## lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16The model will be of the form:
\[ \text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086 \]
Note that all the terms of the model are significant, hence we deem the model as adequate.
When creating polynomial models there are raw and orthogonal polynomials,
# Orthogonal (fixes the correlation between the coefficients))
polymodel_Orth <- lm(Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
summary(polymodel_Orth)## 
## Call:
## lm(formula = Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                             14.0225     0.2286  61.353
## poly(x = TV, degree = 3, raw = FALSE)1  57.5727     3.2322  17.812
## poly(x = TV, degree = 3, raw = FALSE)2  -6.2288     3.2322  -1.927
## poly(x = TV, degree = 3, raw = FALSE)3   4.0074     3.2322   1.240
##                                        Pr(>|t|)    
## (Intercept)                              <2e-16 ***
## poly(x = TV, degree = 3, raw = FALSE)1   <2e-16 ***
## poly(x = TV, degree = 3, raw = FALSE)2   0.0554 .  
## poly(x = TV, degree = 3, raw = FALSE)3   0.2165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16# Pure/Raw (has the advantage that you can interpret the coefficients, but the coefficients will depend on each other and be hence correlated.)
polymodel <- lm(Sales ~ I(TV*TV*TV) + I(TV*TV) + (TV), data = adv)
summary(polymodel)    #I is used to inhibit the interpretation of * as relating to the model,## 
## Call:
## lm(formula = Sales ~ I(TV * TV * TV) + I(TV * TV) + (TV), data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.420e+00  8.641e-01   6.272 2.23e-09 ***
## I(TV * TV * TV)  5.572e-07  4.494e-07   1.240 0.216519    
## I(TV * TV)      -3.152e-04  2.022e-04  -1.559 0.120559    
## TV               9.643e-02  2.580e-02   3.738 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16                            # Instead of representing interaction it represents TV^3
polymodel_Raw<- lm(Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
summary(polymodel_Raw)## 
## Call:
## lm(formula = Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            5.420e+00  8.641e-01   6.272
## poly(x = TV, degree = 3, raw = TRUE)1  9.643e-02  2.580e-02   3.738
## poly(x = TV, degree = 3, raw = TRUE)2 -3.152e-04  2.022e-04  -1.559
## poly(x = TV, degree = 3, raw = TRUE)3  5.572e-07  4.494e-07   1.240
##                                       Pr(>|t|)    
## (Intercept)                           2.23e-09 ***
## poly(x = TV, degree = 3, raw = TRUE)1 0.000243 ***
## poly(x = TV, degree = 3, raw = TRUE)2 0.120559    
## poly(x = TV, degree = 3, raw = TRUE)3 0.216519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16The 3rd degree coefficient is not significant, hence we consider the 2nd degree:
quadmod <- lm(formula = Sales ~ I(TV*TV) + TV, data = adv)
summary(quadmod)## 
## Call:
## lm(formula = Sales ~ I(TV * TV) + TV, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6844 -1.7843 -0.1562  2.0088  7.5097 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.114e+00  6.592e-01   9.275  < 2e-16 ***
## I(TV * TV)  -6.847e-05  3.558e-05  -1.924   0.0557 .  
## TV           6.727e-02  1.059e-02   6.349 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.237 on 197 degrees of freedom
## Multiple R-squared:  0.619,  Adjusted R-squared:  0.6152 
## F-statistic: 160.1 on 2 and 197 DF,  p-value: < 2.2e-16The quadratic term is not sufficiently significant so the model is rejected
The model selected is the multiple linear regression with the interaction from before:
\[ \text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086 \]