Material of Tue 12 2019, week 2
adv <- read.csv(file = "Datasets/Advertising.csv", header = TRUE, sep = ",")
head(adv)
## TV Radio Newspaper Sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
str(adv)
## 'data.frame': 200 obs. of 4 variables:
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ Radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ Newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ Sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
summary(adv)
## TV Radio Newspaper Sales
## Min. : 0.70 Min. : 0.000 Min. : 0.30 Min. : 1.60
## 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75 1st Qu.:10.38
## Median :149.75 Median :22.900 Median : 25.75 Median :12.90
## Mean :147.04 Mean :23.264 Mean : 30.55 Mean :14.02
## 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10 3rd Qu.:17.40
## Max. :296.40 Max. :49.600 Max. :114.00 Max. :27.00
So I’d like to do a shiny ggplot here, however it’s probably just as easy to use tabset by appending {.tabset}
to the heading
# par(mfrow=c(2,2))
# plot(lm(y~x))
Multiple Plots may be fitted into one output using either the par()
package or the layout()
package, I personally prefer the layout()
package, I think because in the past I had a bad experience with par()
:
par
:
par( mfcol = c(ROW, COLS))
par (mfcol = c(ROW, COLS))
mfrow
and mfcol
are identical in this caselayout
:
layout(MATRIX)
layout(matrix(1:3, nrow = 1))
will fit the plots to the following matrix in the order specified:# Set the layout:
# Using `layout()` command:
layout(matrix(1:3, nrow =1))
# using `par()` command:
#par(mfrow=c(1,3)) # Specify the
# Set the plot Domain
pdom <- c(0, 300) #Plot Domain
#Generate the plots
plot(formula = Sales ~ TV, data = adv, xlim = pdom,
main = "Sales Given TV Advertising")
plot(formula = Sales ~ Newspaper, data = adv, xlim = pdom,
main = "Sales Given Newspaper Advertising")
plot(formula = Sales ~ Radio, data = adv, xlim = pdom,
main = "Sales Given Radio Advertising")
adv$MeanAdvertising <- rowMeans(adv[,c(!(names(adv) == "Sales"))])
AdvTVPlot <- ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
stat_smooth(method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) +
##stat_smooth(method = 'lm', formula = y ~ log(x), se = FALSE) +
labs(col = "Mean Advertising", x= "TV Advertising")
if(knitr::is_latex_output()){
AdvTVPlot
} else {
AdvTVPlot%>% ggplotly()
}
AdvRadPlot <- ggplot(data = adv, aes(x = Radio, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
labs(col = "Mean Advertising", x= "Radio Advertising") +
geom_smooth(method = 'lm')
# padv %>% ggplotly() plotly doesn't work with knitr/LaTeX so test the output and choose accordingly:
#Thise could be combined into an interactive graph by wrapping in ggplotly(padv)
if(knitr::is_latex_output()){
AdvRadPlot
} else {
AdvRadPlot %>% ggplotly()
}
AdvNewsPlot <- ggplot(data = adv, aes(x = Newspaper, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
labs(col = "Mean Advertising", x= "Newspaper Advertising")
# padv %>% ggplotly() plotly doesn't work with knitr/LaTeX so test the output and choose accordingly:
#Thise could be combined into an interactive graph by wrapping in ggplotly(padv)
if(knitr::is_latex_output()){
AdvNewsPlot
} else {
AdvNewsPlot %>% ggplotly()
}
pdom <- c(0, 300) #Plot Domain
plot(formula = Sales ~ TV, data = adv, xlim = pdom,
main = "Sales Given TV Advertising")
plot(formula = Sales ~ Newspaper, data = adv, xlim = pdom,
main = "Sales Given Newspaper Advertising")
plot(formula = Sales ~ Radio, data = adv, xlim = pdom,
main = "Sales Given Radio Advertising")
The corellation coefficient can be found by using the cor
function, it is a measurement of the strength of a linear relationship ranging from -1, to 1, wherein a value of 0 would represent no relationship.
The Pearson Correlation Coeffient tends to be used over other models and it’s value is determined by: \[ r_{xy} = \frac{\sum^{n}_{i= 1} \left[ x_i - \overline{x} \right] \times \left( y_i - \overline{y} \right)}{\sqrt{\sum^{n}_{i= 1} \left[\left( x_i - \overline{x} \right)^2 \right]} \sqrt{\sum^{n}_{i= 1} \left[ \left( y_i- \overline{y} \right)^2 \right]}} \] Some of the assumptions underlying the Correlation Coefficient are: 1
the correlation coefficient in this case can be found by using cor(x = adv$TV, y = adv$Sales)
and provides that \(r \approx\) 0.78 This might not be a meaningufl value however because the variance of the sales appears to increase as advertising increases, if that is overlooked however the pearson correlation coefficient provides that the model is a reasonably strong positive linear model.
The parameter estimates may be returned by summarising the model with summary(lm)
lmMod <- lm(formula = Sales ~ TV, data = adv)
lmSum <- summary(lmMod)
lmSum
##
## Call:
## lm(formula = Sales ~ TV, data = adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
lmSum$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.03259355 0.457842940 15.36028 1.40630e-35
## TV 0.04753664 0.002690607 17.66763 1.46739e-42
lmMod2 <- lm(formula = Sales ~ TV, data = adv)
In this case we have:
The standard deviation of a statistic used as an estimator of a population parameter is often referred to as the standard error of the estimator (S.E.); it is the \(\pm\) values specified above:
Where:
You may also have the standard deviation of the residuals (the distance along the y-axis of a point from the regression line), this is known as the Residual Standard Error and is calculated via the Ordinary Least Squares Method 3, it is is given by:
\[\begin{align} \sigma_{\varepsilon} = S.E. & = \sqrt{\frac{\textbf{RSS}}{N}}\\ \ \\ &= \sqrt{\frac{\sum^{n}_{i= 1} \left[ \left( y_i - \hat{y}_i \right)^2 \right]}{N}} \end{align}\]
Which you’ll notice is identical to the RMSE.
so by the emperical method \(2\times \text{S.E.}\) would represent a 95% confidence interval (rather than prediction interval) of the expected \(y\)-values. Drawing such a confidence interval:
paramint <- confint(object = lm(adv$Sales ~ adv$TV), level = 0.95) %>% signif(2)
paramint
## 2.5 % 97.5 %
## (Intercept) 6.100 7.900
## adv$TV 0.042 0.053
So drawing from this we could expect, with only a 5% probability of incorrectly rejecting the null hypothesis that there is no relationship, that in the absence of advertising, the TV sales to fall between 6.1 and 7.9.
With the same degree and type of certainty it could also be oncluded that for every $1000 increase in advertising, the tv sales will increase by between 42 and 53.
If it is appropriate to fit a linear model to data, then we can test for correlation between the data points by considering whether or not the slope value is non-zero \(\beta_1 \neq 0\), this is because a zero coefficient would be such that the model would predict \(Y = C + \varepsilon\), this means that \(X\) is not a feature/predictor of \(Y\), however \(Y\) may still be a function of (or rather response variable of) other values other factors that are ‘behind the scenes’.4
So our hypotheses would be:
\[\begin{align} H_0 : \enspace \beta_1 &= 0 \qquad ( \small {\text{ The null hypothesis is that nothings related}})\\ H_1 : \enspace \beta_1 &= 1 \end{align}\]
So our interest is to determine how far from 0 our expected \(\beta_1\) value needs to be from 0 for us to conclude
The expected value of \(\beta_1\) is so far from zero we can conclude that it it’s not zero at some significance level \(^{\dagger}\)"
\(\dagger\) at some low probability of incorrectly rejecting the null hypothesis
The problem is defining how far from zero is far enough, for this we use the expected distance from the regression line, the standard error from above, a value observed observed too many standard deviations to the right of the mean are not very likely too occur.
A statistical method that relies on an underlying assumption of the statistical distribution of the data is known as a a parametric method, in this case, it is a fundamental assumption of Ordinary Least Squares Linear regression that the data is normally distributed.5
This is a situation where we use the Student’s t-test because this is a sample, and the population standard deviation \(\left( \sigma \right)\) is not known and hence the confidence interval for the mean must be made broader in order to account for the fact that the sample standard deviation \(s\) is being used to estimate \(\sigma\)
because the sampled population is normally distributed, the sampling distribution of \(\bar{x}\) will be normally distributed 6 (regardless of sample size) and centred about \(\mu\) with a a standard deviation of \(\frac{\sigma}{\sqrt{n}}\). If the population was non-normal the sampling distribution will be approximately normal for \(n\geq 30\).
Because \(\frac{\sigma}{n}\) is the standard deviation of the the sample mean \(\bar{x}\) it is reffered to as the Standard Error of the mean 7, so we could calculate the critical value along the standard normal distribution corresponding to the the sampling distribution in order to determine probabilities, however, \(\sigma\) is unknown and using \(s\) instead will not create a normal distribution, the distriution it creates is Gosset’s Student’s t-distribution 8:
\[\begin{align} t = \frac{\overline{x}- \mu}{\frac{s}{\sqrt{n}}} \end{align}\]
So in this case our test statistic will be:
\[\begin{align} t = \frac{\hat{\beta}_1- 0}{\text{SE}\left( \hat{\beta}_1 \right)} \end{align}\]
In order to perform this test in R we can use qnorm
and qt
to return critical values, t.test
will perform a hypothesis test directly from input data but that’s not suitable here.
tcritval <- qt(p = 0.05,df =nrow(adv)-2 )
tcritval %>% signif(2)
## [1] -1.7
So the critical t-value is -1.7 and from the summary call from before we have that the t-statistic is 17, which far exceeds this, as a matter of fact further over to the right the p-value is reported at \(\alpha = 10^{-16}\).
In practice you’d just read off the p-values and pick the ones with *
to the right of them, the more *
the more significance.
Hence we reject the hypothesis that no relationship exists at an extremely low probability of incorrectly doing so (i.e. low probability of commiting type 1 error).
In order to plot this inside base packages, feed the model object, i.e. lm(Y~X)
inside a call to abline()
in order to plot the model over the top of the base plot, so all together it might look like: 9
Form <- Sales ~ TV
Lmodel <- lm(formula = Form, data = adv, na.action = na.exclude)
plot(Form, data = adv)
abline(Lmodel)
Or you could do it like this even, but I think the way above is better syntax because it will behave better with ‘predict’ function and follows tidyverse
syntax
Lmodel <- lm(adv$Sales ~ adv$TV)
plot(x = adv$TV, y = adv$Sales)
abline(a = Lmodel$coefficients[1], b = Lmodel$coefficients[2])
plot(formula = Sales ~ TV, data = adv, xlim = pdom,
main = "Sales Given TV Advertising")
abline(lmMod)
AdvTVPlot <- ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
stat_smooth(method = 'lm', formula = y ~ x, se = FALSE)
AdvTVPlot
If we needed to feed ggplot a specific model we could do that like this, but it’s a whole thing to do and you’d probably rather not do it this way, but if you really really need to
AdvTVPlot <- ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
stat_smooth(
method = "lm",
mapping = aes( y = predict(lmMod)
)
)
AdvTVPlot
The model can be assed by considering the:
The texbook makes, in my opinion, a mistake in that it refers to the the Root Mean Square Error (RMSE) as the Residual Standard Error (RSE) 10, this is true, the standard error of the residuals (\(\varepsilon\)) would be the RMSE, so we would have RMSE \(= \sigma_{\varepsilon}\), that’s fine.
The issue is there is another common term used called the Relative Squared Error (RSE) is often used 11 and so this is hence ambiguous, hence forth I will:
Recall that the model was of the form \(Y = \beta_1 X + \beta_0 + \varepsilon\), the RMSE (Root Mean Square Error) is the standard deviation of \(\varepsilon\) as measured along the \(Y\)-axis:
\[\begin{align} \sigma_{\varepsilon} = \sqrt{\frac{\sum^{n}_{i= 1} \left[ \left( y_i - \hat{y}_i \right)^2 \right]}{N}} \end{align}\]
This value can be returned from R by investigating the anova table:
anova(lmMod)
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## TV 1 3314.6 3314.6 312.14 < 2.2e-16 ***
## Residuals 198 2102.5 10.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA table it can be seen that the average squared residual is 10.6
\[\begin{align} \text{mean}\left( \varepsilon^2\right) &= 10.6\\ \implies \frac{1}{n} \cdot \sum^{n}_{i=1} \left[ \varepsilon_i \right] & =10.6\\ \implies \frac{1}{n} \cdot \sum^{n}_{i=1} \left[ \left( \hat{y}_i - y_i \right)^2 \right] & =10.6\\ \implies \sqrt{\frac{1}{n} \cdot \sum^{n}_{i=1} \left[ \left( \hat{y}_i - y_i \right)^2 \right] } & = 3.2\\ \ \\ \implies \sigma_{\varepsilon} &= 3.2 \end{align}\]
Thus we may conclude that we expect the model to predict the sales within \(\pm\) 3.2 units, which is quite predictive and hence useful.
The coefficient of determination is the proportion of variation within the model that is explained by the model:
\[\begin{align} R^2 &= \frac{TSS-RSS}{TSS}\\ &= \frac{3315}{3315+2103}\\ \ \\ &= 0.612 \end{align}\]
In practice we would simply extract the coefficient of determination (\(R^2\)) from the model-summary:
lmSum$r.squared %>% round(3) %>% percent()
## [1] "61.2%"
This value suggests that a reasonable amount of the variation is explained by the model, but perhaps a non-linear model could explain more of the variance. (be careful a significant coefficient of determination doesn’t necessarily mean that the slope is significantly different from 0)
layout(matrix(1:4, nrow = 2))
plot(lmMod)
When making predictions is important to ensure that the names of a data frame are syntactically correct
, otherwise you will have a bad day trying to get predict to work and ggplot2 to work because specifying the data frame names in a formula will be difficult, make sure that names are always syntactically valid.
what is important is you create your model with the correct syntax, if you create your model like this:
mymodelWRONG <- lm(adv$Sales ~ adv$TV)
you won’t be able to predict data like this:
predict(object = lmMod, newdata = data.frame("TV" = 300))
you’ll just get an error that says 'newdata' had 1 row but variables found have 200 rows
, you have to give the variables corresponding names so that the model object can save them for later and make the connection, for instance, if inspect the terms from above you will get:
mymodelWRONG[["terms"]]
which outputs, at the tail end:
adv$Sales adv$TV
"numeric" "numeric"
where as if you create the model like this:
lmModCORRECT <- lm(formula = Sales ~ TV, data = adv)
predict(object = lmModCORRECT, newdata = data.frame("TV" = 300))
and inspect the terms with:
lmModCORRECT[["terms"]]
you will get this as output
Sales TV
"numeric" "numeric"
where Sales
and TV
are the outputs of names(adv)
and so I can use that when I use predict. You should not use attach
it will cause problems later, however, it can be nice to use attach just before a predict call to get auto completed names and then remove attach and re-execute the script .
So always use the lm(formula = Y~X, data = myDF)
because it works the best; you have to use the same syntax/format when using predict or ggplot anyway so there’s no reason not to use the same syntax throughout anyway.
Also the lecturer said to use lists, I reckon use data frames because that way your newdata
matches the input data one-to-one, moreover:
input = 3
output <- predict(object = lmMod, newdata = data.frame("TV" = 3))
predDatasingle <- data.frame(input, output)
names(predDatasingle) <- names(adv[c(1,4)])
print(predDatasingle)
## TV Sales
## 1 3 7.175203
input <- seq(from = 100, to = 900, by = 100)
output <- predict(object = lmMod, newdata = data.frame("TV" = input))
predDF <- data.frame(input, output)
names(predDF) <- names(adv[c(1,4)])
predDF
## TV Sales
## 1 100 11.78626
## 2 200 16.53992
## 3 300 21.29359
## 4 400 26.04725
## 5 500 30.80091
## 6 600 35.55458
## 7 700 40.30824
## 8 800 45.06191
## 9 900 49.81557
i.e. chosing \(\beta_0\) and \(\beta_1\) to minimise \(\left(\textbf{RSS} = \sum^{n}_{i= 1} \left[ \left( y_i - \hat{y_i} \right)^2 \right] \right)\)↩︎
Refer to page 67 of the text book, section [3.1.2]↩︎
Mendenhall, Introduction to Probability & Statistics p. 254 [7.4]↩︎
By the Central Limit Theorem↩︎
Mendenhall, Introduction to Probability & Statistics p. 254 [7.4]↩︎
Mendenhall, Introduction to Probability & Statistics p. 254 [7.4]↩︎
na.exclude
will pad values extracted so lengths are the same, na.omit
will not↩︎
Refer to Page 69 of the TB for RMSE definition, the TB divides by DF which is probably more correct that dividing by sample size.↩︎
An Introduction to Data Science : Model Evaluation - Regression↩︎
refer to page 96 of the TB, log or exp transforming may be appropriate here, the data is not homosdcedastic and is hence said to be heteroscedastic. #### How to use predict↩︎