Material of Tue 19 March2019, week 3
First import the data and investigate it:
adv <- read.csv(file = "Datasets/Advertising.csv", header = TRUE, sep = ",")
head(adv)
## TV Radio Newspaper Sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
writeLines("\n")
print("***Dimensions***")
## [1] "***Dimensions***"
writeLines("\n")
dim(adv)
## [1] 200 4
writeLines("\n")
print("***Summary***")
## [1] "***Summary***"
writeLines("\n")
summary(adv)
## TV Radio Newspaper Sales
## Min. : 0.70 Min. : 0.000 Min. : 0.30 Min. : 1.60
## 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75 1st Qu.:10.38
## Median :149.75 Median :22.900 Median : 25.75 Median :12.90
## Mean :147.04 Mean :23.264 Mean : 30.55 Mean :14.02
## 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10 3rd Qu.:17.40
## Max. :296.40 Max. :49.600 Max. :114.00 Max. :27.00
writeLines("\n")
print("***Structure***")
## [1] "***Structure***"
writeLines("\n")
str(adv)
## 'data.frame': 200 obs. of 4 variables:
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ Radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ Newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ Sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
writeLines("\n")
Fom this we can tell that there is one output, with 3 input values and 200 Observations.
pairs(x = adv)
In order to use corrplot
first create a correlation matrix using cor(adv)
then feed that matrix to corrplot
with the command corrplot(cor(adv))
#coriris <- cor(iris[,!(names(iris) == "Species")])
#corrplot(method = 'ellipse', type = 'lower', corr = coriris)
corMat <- cor(x = adv)
corrplot(corr = corMat, method = "ellipse", type = "lower")
From this we can see that there is a significant amount of correlation between Radio and newspaper (moreso even that newspaper and sales), we should consider this variable interaction when decide upon our model
A multiple linear regression would give the model:
advModMult <- lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
advModMult
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
##
## Coefficients:
## (Intercept) TV Radio Newspaper
## 2.938889 0.045765 0.188530 -0.001037
This gives that the appropriate linear model is:
\[ Y_{Sales} = 0.0458 \times \text{TV} + 0.19 \times \text{Radio} - 0.001 \times \text{Newspaper} \] the fact that Newspaper has a negative coefficient despite being positively correlated with sales is indicative of the weak effect newspaper advertising has on sales as well as the interaction between newspaper and sales.
from this it can be ### (d) Test the significance of the parameters and find the resulting model to model Sales in terms of advertising modes, TV, Radio and Newspaper. First Summarise the Model:
advModMult %>% summary()
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
A summary of the model provides that given the hypothesis test: \[ H_0: \enspace \beta_i = 0\\ H_a: \enspace \beta_i \neq 0 \\ \qquad \qquad \qquad \qquad \forall i \in \mathbb{N} \]
There would be an extremely low probability of incorrectly rejecting the null hypothesis that the given a linear model the coefficients would be zero, hence it is accepted that the coefficients are non-zero except for newspaper advertising.
There is a high probability of incorrectly rejecting the null-hypothesis and hence that should not be rejected and the nespaper advertising should not be seen as significant predictor for sales in this linear model.
It is hence approprate to remove, via backwards selection, Newspaper from the model, which gives:
## Calculate Power??
# n <- length(adv)
# sigma <- sd(adv$Newspaper)
# sem = sigma/sqrt(n)
# alpha <- 0.05
# mu0 <- 0
# q <- qnorm(p = 0.005, mean = mu0, sd = sem);q
# mu <- q #assumed actual mean value
# pnorm(q, mean = mu, sd = sem, lower.tail = FALSE)
advModMultb1 <- lm(formula = Sales ~ TV + Radio, data = adv)
advModMultb1
##
## Call:
## lm(formula = Sales ~ TV + Radio, data = adv)
##
## Coefficients:
## (Intercept) TV Radio
## 2.92110 0.04575 0.18799
advModMultb1.sum <- summary(advModMultb1)
\[ \text{Sales} = 0.045 \times \text{TV} + 0.19 \times \text{Radio} + 2.9211 \]
All parameters are highly significant and hence the model is deemed adequate, the model explains 100% of the variation, this can be found by appending $r.squared
to the model object.
In order to assess the model, consider the anova table and derive the RMSE and \(R^2\) values:
advModMultb1.anova <- anova(advModMultb1)
advModMultb1.anova
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## TV 1 3314.6 3314.6 1172.50 < 2.2e-16 ***
## Radio 1 1545.6 1545.6 546.74 < 2.2e-16 ***
## Residuals 197 556.9 2.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this we can see that that the \(F\) statistic is associated with a very low p-value, these are the exact same p-values from the summary
call.
The RMSE value is the Root mean square error, it is the standard error of the residuals (recall that the standard error is the standard deviation of a model parameter), so the RMSE is basically the standard deviation of the residuals of the model (as measured along the \(y\)-axis).
\[ \text{RMSE} = \sqrt{\frac{\sum{\varepsilon^2}}{n}} \]
rmse <- function(model){
# (sum(model$residuals**2)/length(model$residuals))**0.5
sd(advModMultb1$residuals)
}
rse <- function(model){
var(model$residuals)/var(model$residuals - model$fitted.values)
}
data.frame("RMSE" = rmse(advModMultb1) , "RSE" = rse(advModMultb1) )
## RMSE RSE
## 1 1.672891 0.1028057
So the expected error of the model is \(\pm\) 1.7 units of sale, we could use this to create a confidence interval by multiplying by the corresponding Student’s t-statistic and saying that we would expect an observed value to lie within \(1.96 * \text{S.E}\) of the model 95% of the time (is this correct or is it the expected mean or something because of the Central Limit Theorem?).
The coefficient of determination is 89.7%, this can be returned by extracting the value from the object by apending $r.squared
to the model summary (i.e. use summary(lm( Y ~ X1 + X2 )))$r.squared
).
The \(R^2\) value will be very nearly the same between the initial model and the backwards selected model, however given that the initial model had non-significant predictors it could be considered as over-parameterized, i.e. it violates Occam’s Razor.
The coefficient of determination is the ratio of the total variance of the data that is explained by the model, in this case it could be determined by:
\[ R^2 = \frac{TSS-SSE}{TSS} = \frac{3314.6+1546+0.1}{3315+1545+0.1+556.8} = 89\% \ \\ \]
The \(R^2\) value is derived from the ANOVA table thusly:
advModMultb1.anova <- anova(advModMultb1)
TSS_Multb1 <- advModMultb1.anova$`Sum Sq` %>% sum()
RSS_Multb1 <- advModMultb1$residuals^2 %>% sum()
((TSS_Multb1- RSS_Multb1)/(TSS_Multb1)) %>% signif(3) %>% percent() #Requires `scales` package
## [1] "89.7%"
When determining the accuracy or performance of amodel, always turn your mind to residual analysis (i.e. how normally are they distributed), this will be performed below.
The residuals and fitted values may be returned by extracting them from the model (you could always derive them from first principles but you should use the tools at your disposal, it is quicker, less error prone and makes more readable code)
ResDF <- data.frame("Input" = advModMultb1$fitted.values - advModMultb1$residuals, "Output" = advModMultb1$fitted.values, "Error" = advModMultb1$residuals)
ResDF %>% head()
## Input Output Error
## 1 19.01093 20.55546 1.5445354
## 2 14.29072 12.34536 -1.9453623
## 3 15.37404 12.33702 -3.0370177
## 4 16.73423 17.61712 0.8828840
## 5 13.54782 13.22391 -0.3239081
## 6 17.82417 12.51208 -5.3120845
Predict
will also return fitted values if no newdata
is specified.
ggplot(data = ResDF, aes(x = Output, y = Error, col = Error )) +
geom_point() +
theme_bw() +
stat_smooth(col = "grey")+
stat_smooth(method = "lm", se = 0, ) +
scale_color_gradient(low = "indianred", high = "royalblue") +
labs(y = "Residuals", x = "Predicted Value", title = "Residuals Plotted against Output")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
plot(Error ~ Output, data = ResDF, pch = 19, col = "IndianRed", main = "Residuals against Predictions")
smooth <- loess(formula = Error ~ Output, data = ResDF, span = 0.8)
predict(smooth) %>% lines(col = "Blue")
abline(lm(Error ~ Output, data = ResDF), col = "Purple")
From this it can be seen that the residuals are centred around zero, but they are not normally distributed, the model performance appears to fail normality assumptions for values \(\in \left( 10, 20 \right)\)
ggplot(data = ResDF, aes(x = Error, col = Output)) +
geom_histogram() +
theme_classic() +
labs(y = "Absolute Count", x = "Residual Value", title = "Residual Distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(ResDF, aes(x=Error)) +
geom_histogram(aes(y = stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'IndianRed'
) +
theme_classic() +
labs(title = "Histogram of Residuals", y = "Density of Observations", x= "Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can visualise the residuals as white noise:
# Put some White Noise on the ResDF
ResDF <- cbind(ResDF, "Wnoise" = rnorm(nrow(ResDF), mean = 0, sd = sd(ResDF$Error)))
## Our Residuals
ggplot(ResDF, aes(x = Input, y = Error, col = Output)) +
geom_line() +
geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
theme(legend.position = "none") +
labs(title = "Distribution of Residuals")
## White Noise
ggplot(ResDF, aes(x = Input, y = Wnoise, col = Output)) +
geom_line() +
geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
theme(legend.position = "none") +
labs(title = "Distribution of Residuals")
This clearly shows that the Residuals are non-normal.
hist(ResDF$Error
, breaks = 30,
prob=TRUE,
lwd=2,
main = "Residual Distribution",
xlab = "Distance from the Model", border = "purple"
)
#Overlay the Normal Dist Curve
x <- 1:100 #Stupid base package wants some f(x), this is hacky but easier than stuffing around with lines or defining a function
curve(dnorm(x, mean(ResDF$Error), sd(ResDF$Error)), add=TRUE, col="purple", lwd=2) #Draws the actual density function
#lines(density(possibleyvals_conf), col='purple', lwd=2) #Draws the observed density function
lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))
abline(v=upr_conf, col='pink', lwd=3)
abline(v=lwr_conf, col='pink', lwd=3)
abline(v=mean(ResDF$Error), lwd=2, lty='dotted')
lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))
It hence appears that the Residuals skewed left, which suggests an upper bound on the residual value, the histogram is sufficiently non-normal to reject the assumption that the residuals are normally distributed
The Residuals can be analised by plotting the model:
plot(advModMultb1)
This is an inappropriate model because the residuals are non-normally distributed.
newdata = data.frame("TV" = c(3, 1, 2), "Radio" = c(4, 5, 6))
Mod_Adv_Mult_b1 <- predict(object = advModMultb1, newdata)
mypreds <- data.frame("Input" = newdata, "Output" = Mod_Adv_Mult_b1)
mypreds # Careful with the names, they workout as Input.name in this case.
## Input.TV Input.Radio Output
## 1 3 4 3.810341
## 2 1 5 3.906826
## 3 2 6 4.140575
reconsider the correlationplots:
pairs(x = adv)
adv[, names(adv)!="Newspaper"] %>% cor() %>% corrplot(method = "ellipse", type = "lower")
From this we can determine that there is no interaction between TV and radio (however previously there was interaction between newspaper and Radio, we should consider that next)
int_mod_adv <- lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
int_mod_adv
##
## Call:
## lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
##
## Coefficients:
## (Intercept) TV Radio TV:Radio
## 6.750220 0.019101 0.028860 0.001086
summary(int_mod_adv)
##
## Call:
## lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
The model will be of the form:
\[ \text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086 \]
Note that all the terms of the model are significant, hence we deem the model as adequate.
When creating polynomial models there are raw and orthogonal polynomials,
# Orthogonal (fixes the correlation between the coefficients))
polymodel_Orth <- lm(Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
summary(polymodel_Orth)
##
## Call:
## lm(formula = Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9734 -1.8900 -0.0897 2.0189 7.3765
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 14.0225 0.2286 61.353
## poly(x = TV, degree = 3, raw = FALSE)1 57.5727 3.2322 17.812
## poly(x = TV, degree = 3, raw = FALSE)2 -6.2288 3.2322 -1.927
## poly(x = TV, degree = 3, raw = FALSE)3 4.0074 3.2322 1.240
## Pr(>|t|)
## (Intercept) <2e-16 ***
## poly(x = TV, degree = 3, raw = FALSE)1 <2e-16 ***
## poly(x = TV, degree = 3, raw = FALSE)2 0.0554 .
## poly(x = TV, degree = 3, raw = FALSE)3 0.2165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared: 0.622, Adjusted R-squared: 0.6162
## F-statistic: 107.5 on 3 and 196 DF, p-value: < 2.2e-16
# Pure/Raw (has the advantage that you can interpret the coefficients, but the coefficients will depend on each other and be hence correlated.)
polymodel <- lm(Sales ~ I(TV*TV*TV) + I(TV*TV) + (TV), data = adv)
summary(polymodel) #I is used to inhibit the interpretation of * as relating to the model,
##
## Call:
## lm(formula = Sales ~ I(TV * TV * TV) + I(TV * TV) + (TV), data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9734 -1.8900 -0.0897 2.0189 7.3765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.420e+00 8.641e-01 6.272 2.23e-09 ***
## I(TV * TV * TV) 5.572e-07 4.494e-07 1.240 0.216519
## I(TV * TV) -3.152e-04 2.022e-04 -1.559 0.120559
## TV 9.643e-02 2.580e-02 3.738 0.000243 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared: 0.622, Adjusted R-squared: 0.6162
## F-statistic: 107.5 on 3 and 196 DF, p-value: < 2.2e-16
# Instead of representing interaction it represents TV^3
polymodel_Raw<- lm(Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
summary(polymodel_Raw)
##
## Call:
## lm(formula = Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9734 -1.8900 -0.0897 2.0189 7.3765
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 5.420e+00 8.641e-01 6.272
## poly(x = TV, degree = 3, raw = TRUE)1 9.643e-02 2.580e-02 3.738
## poly(x = TV, degree = 3, raw = TRUE)2 -3.152e-04 2.022e-04 -1.559
## poly(x = TV, degree = 3, raw = TRUE)3 5.572e-07 4.494e-07 1.240
## Pr(>|t|)
## (Intercept) 2.23e-09 ***
## poly(x = TV, degree = 3, raw = TRUE)1 0.000243 ***
## poly(x = TV, degree = 3, raw = TRUE)2 0.120559
## poly(x = TV, degree = 3, raw = TRUE)3 0.216519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared: 0.622, Adjusted R-squared: 0.6162
## F-statistic: 107.5 on 3 and 196 DF, p-value: < 2.2e-16
The 3rd degree coefficient is not significant, hence we consider the 2nd degree:
quadmod <- lm(formula = Sales ~ I(TV*TV) + TV, data = adv)
summary(quadmod)
##
## Call:
## lm(formula = Sales ~ I(TV * TV) + TV, data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6844 -1.7843 -0.1562 2.0088 7.5097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.114e+00 6.592e-01 9.275 < 2e-16 ***
## I(TV * TV) -6.847e-05 3.558e-05 -1.924 0.0557 .
## TV 6.727e-02 1.059e-02 6.349 1.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.237 on 197 degrees of freedom
## Multiple R-squared: 0.619, Adjusted R-squared: 0.6152
## F-statistic: 160.1 on 2 and 197 DF, p-value: < 2.2e-16
The quadratic term is not sufficiently significant so the model is rejected
The model selected is the multiple linear regression with the interaction from before:
\[ \text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086 \]