Simple Linear Regression

Material of Tue 12 2019, week 2

Question 1

(a) Import the Data

adv <- read.csv(file = "Datasets/Advertising.csv", header = TRUE, sep = ",")

Inspect the structure of the Data Set

head(adv)
##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
str(adv)
## 'data.frame':    200 obs. of  4 variables:
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ Radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ Newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ Sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
summary(adv)
##        TV             Radio          Newspaper          Sales      
##  Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
##  1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:10.38  
##  Median :149.75   Median :22.900   Median : 25.75   Median :12.90  
##  Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :14.02  
##  3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:17.40  
##  Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00

(b) Construct Scatter Plots

So I’d like to do a shiny ggplot here, however it’s probably just as easy to use tabset by appending {.tabset} to the heading

# par(mfrow=c(2,2))
# plot(lm(y~x))

Multiple Plots may be fitted into one output using either the par() package or the layout() package, I personally prefer the layout() package, I think because in the past I had a bad experience with par():

  • In order to use par:
    • par( mfcol = c(ROW, COLS))
    • par (mfcol = c(ROW, COLS))
      • That’s not a typo, mfrow and mfcol are identical in this case
  • In order to use layout:
    • layout(MATRIX)
      • The matrix should be a grid, the plots will be fed to that grid in numerical order so for example:
        • layout(matrix(1:3, nrow = 1)) will fit the plots to the following matrix in the order specified:
        \[ \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} \]
  • **In order to use `grid.layout():
    • grid.arrange(plot1, plot2, ncol = 2))
      • This is the only one that will work with ggplot2

Multi Fit Base Plots

# Set the layout:
  # Using `layout()` command:

    layout(matrix(1:3, nrow =1))

  # using `par()` command:

    #par(mfrow=c(1,3)) # Specify the

# Set the plot Domain
pdom <- c(0, 300) #Plot Domain

#Generate the plots
plot(formula = Sales ~ TV, data = adv, xlim = pdom,
     main = "Sales Given TV Advertising")
plot(formula = Sales ~ Newspaper, data = adv, xlim = pdom,
     main = "Sales Given Newspaper Advertising")
plot(formula = Sales ~ Radio, data = adv, xlim = pdom,
     main = "Sales Given Radio Advertising")

GGPlot

Television Advertising
adv$MeanAdvertising <- rowMeans(adv[,c(!(names(adv) == "Sales"))])

AdvTVPlot <- ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
  geom_point() + 
  theme_bw() +
  stat_smooth(method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) +
 ##stat_smooth(method = 'lm', formula = y ~ log(x), se = FALSE) +
  labs(col = "Mean Advertising", x= "TV Advertising") 

if(knitr::is_latex_output()){
 AdvTVPlot 
} else {
  AdvTVPlot%>% ggplotly()
}
Radio Advertising
 AdvRadPlot <- ggplot(data = adv, aes(x = Radio, y = Sales, col = MeanAdvertising)) +
   geom_point() + 
   theme_bw() +
   labs(col = "Mean Advertising", x= "Radio Advertising") + 
   geom_smooth(method = 'lm')


# padv %>% ggplotly() plotly doesn't work with knitr/LaTeX so test the output and choose accordingly:
#Thise could be combined into an interactive graph by wrapping in ggplotly(padv)

if(knitr::is_latex_output()){
  AdvRadPlot 
} else {
 AdvRadPlot %>% ggplotly()
}
Newspaper Advertising
AdvNewsPlot <- ggplot(data = adv, aes(x = Newspaper, y = Sales, col = MeanAdvertising)) +
  geom_point() + 
  theme_bw() +
  labs(col = "Mean Advertising", x= "Newspaper Advertising")

# padv %>% ggplotly() plotly doesn't work with knitr/LaTeX so test the output and choose accordingly:
#Thise could be combined into an interactive graph by wrapping in ggplotly(padv)

if(knitr::is_latex_output()){
  AdvNewsPlot
} else {
AdvNewsPlot %>% ggplotly()
}

Base Plot

pdom <- c(0, 300) #Plot Domain
plot(formula = Sales ~ TV, data = adv, xlim = pdom,
     main = "Sales Given TV Advertising")

plot(formula = Sales ~ Newspaper, data = adv, xlim = pdom,
     main = "Sales Given Newspaper Advertising")

plot(formula = Sales ~ Radio, data = adv, xlim = pdom,
     main = "Sales Given Radio Advertising")

(c) Find the Correlation Coefficient

The corellation coefficient can be found by using the cor function, it is a measurement of the strength of a linear relationship ranging from -1, to 1, wherein a value of 0 would represent no relationship.

The Pearson Correlation Coeffient tends to be used over other models and it’s value is determined by: \[ r_{xy} = \frac{\sum^{n}_{i= 1} \left[ x_i - \overline{x} \right] \times \left( y_i - \overline{y} \right)}{\sqrt{\sum^{n}_{i= 1} \left[\left( x_i - \overline{x} \right)^2 \right]} \sqrt{\sum^{n}_{i= 1} \left[ \left( y_i- \overline{y} \right)^2 \right]}} \] Some of the assumptions underlying the Correlation Coefficient are: 1

  • Independent Observations
  • Normally distributed observations (i.e. follows a bell curve)
  • hmoscedasticity 2
    • This means equal variance of observations
      • i.e. all there is no pattern between the variables and the plot, the points should make a rectangle, not a triangle
  • Normally distributed points
  • the points must make a straight line not a curve

the correlation coefficient in this case can be found by using cor(x = adv$TV, y = adv$Sales) and provides that \(r \approx\) 0.78 This might not be a meaningufl value however because the variance of the sales appears to increase as advertising increases, if that is overlooked however the pearson correlation coefficient provides that the model is a reasonably strong positive linear model.

(d) Assess the accuracy of the parameter estimates

The parameter estimates may be returned by summarising the model with summary(lm)

lmMod <- lm(formula = Sales ~ TV, data = adv)
lmSum <- summary(lmMod)
lmSum
## 
## Call:
## lm(formula = Sales ~ TV, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
lmSum$coefficients
##               Estimate  Std. Error  t value    Pr(>|t|)
## (Intercept) 7.03259355 0.457842940 15.36028 1.40630e-35
## TV          0.04753664 0.002690607 17.66763 1.46739e-42
lmMod2 <- lm(formula = Sales ~ TV, data = adv)

In this case we have:

  • a slope of \(\beta_1 \approx\) 0.048 \(\pm\) 0.0027
  • an Intercept of \(\beta_0 \approx\) 7 \(\pm\) 0.46

The standard deviation of a statistic used as an estimator of a population parameter is often referred to as the standard error of the estimator (S.E.); it is the \(\pm\) values specified above:

  • Standard Error of Slope Coefficient \(\sigma_{\beta_1} = s\sqrt{\frac{1}{n}+ \frac{\overline{x}^2}{SS_x}} = 0.00027\)
  • Standard Errof of Intercept Coefficint \(\sigma_{\beta_0} = \frac{s}{\sqrt{SS_x}} = 0.46\)

Where:

  • \(s\) is the sample standard deviation (OF WHAT?)
  • \(SS_x = \sum^{n}_{i= 1} \left[ x^2_i \right] - n\cdot \left( \overline{x} \right)^2\)
  • s is the sample standard deviation of \(x\)
  • because the sample standard deiation of \(x\) predicts the deviation of \(y\) anyway

You may also have the standard deviation of the residuals (the distance along the y-axis of a point from the regression line), this is known as the Residual Standard Error and is calculated via the Ordinary Least Squares Method 3, it is is given by:

\[\begin{align} \sigma_{\varepsilon} = S.E. & = \sqrt{\frac{\textbf{RSS}}{N}}\\ \ \\ &= \sqrt{\frac{\sum^{n}_{i= 1} \left[ \left( y_i - \hat{y}_i \right)^2 \right]}{N}} \end{align}\]

Which you’ll notice is identical to the RMSE.

so by the emperical method \(2\times \text{S.E.}\) would represent a 95% confidence interval (rather than prediction interval) of the expected \(y\)-values. Drawing such a confidence interval:

paramint <- confint(object = lm(adv$Sales ~ adv$TV), level = 0.95) %>% signif(2)
paramint
##             2.5 % 97.5 %
## (Intercept) 6.100  7.900
## adv$TV      0.042  0.053

So drawing from this we could expect, with only a 5% probability of incorrectly rejecting the null hypothesis that there is no relationship, that in the absence of advertising, the TV sales to fall between 6.1 and 7.9.

With the same degree and type of certainty it could also be oncluded that for every $1000 increase in advertising, the tv sales will increase by between 42 and 53.

(f) Test the significance of the slope of the linear model

If it is appropriate to fit a linear model to data, then we can test for correlation between the data points by considering whether or not the slope value is non-zero \(\beta_1 \neq 0\), this is because a zero coefficient would be such that the model would predict \(Y = C + \varepsilon\), this means that \(X\) is not a feature/predictor of \(Y\), however \(Y\) may still be a function of (or rather response variable of) other values other factors that are ‘behind the scenes’.4

So our hypotheses would be:

\[\begin{align} H_0 : \enspace \beta_1 &= 0 \qquad ( \small {\text{ The null hypothesis is that nothings related}})\\ H_1 : \enspace \beta_1 &= 1 \end{align}\]

So our interest is to determine how far from 0 our expected \(\beta_1\) value needs to be from 0 for us to conclude

The expected value of \(\beta_1\) is so far from zero we can conclude that it it’s not zero at some significance level \(^{\dagger}\)"

\(\dagger\) at some low probability of incorrectly rejecting the null hypothesis

The problem is defining how far from zero is far enough, for this we use the expected distance from the regression line, the standard error from above, a value observed observed too many standard deviations to the right of the mean are not very likely too occur.

Choosing a Parametric method

A statistical method that relies on an underlying assumption of the statistical distribution of the data is known as a a parametric method, in this case, it is a fundamental assumption of Ordinary Least Squares Linear regression that the data is normally distributed.5

This is a situation where we use the Student’s t-test because this is a sample, and the population standard deviation \(\left( \sigma \right)\) is not known and hence the confidence interval for the mean must be made broader in order to account for the fact that the sample standard deviation \(s\) is being used to estimate \(\sigma\)

because the sampled population is normally distributed, the sampling distribution of \(\bar{x}\) will be normally distributed 6 (regardless of sample size) and centred about \(\mu\) with a a standard deviation of \(\frac{\sigma}{\sqrt{n}}\). If the population was non-normal the sampling distribution will be approximately normal for \(n\geq 30\).

Because \(\frac{\sigma}{n}\) is the standard deviation of the the sample mean \(\bar{x}\) it is reffered to as the Standard Error of the mean 7, so we could calculate the critical value along the standard normal distribution corresponding to the the sampling distribution in order to determine probabilities, however, \(\sigma\) is unknown and using \(s\) instead will not create a normal distribution, the distriution it creates is Gosset’s Student’s t-distribution 8:

\[\begin{align} t = \frac{\overline{x}- \mu}{\frac{s}{\sqrt{n}}} \end{align}\]

So in this case our test statistic will be:

\[\begin{align} t = \frac{\hat{\beta}_1- 0}{\text{SE}\left( \hat{\beta}_1 \right)} \end{align}\]

In order to perform this test in R we can use qnorm and qt to return critical values, t.test will perform a hypothesis test directly from input data but that’s not suitable here.

tcritval <- qt(p = 0.05,df =nrow(adv)-2 )
tcritval %>% signif(2)
## [1] -1.7

So the critical t-value is -1.7 and from the summary call from before we have that the t-statistic is 17, which far exceeds this, as a matter of fact further over to the right the p-value is reported at \(\alpha = 10^{-16}\).

In practice you’d just read off the p-values and pick the ones with * to the right of them, the more * the more significance.

Hence we reject the hypothesis that no relationship exists at an extremely low probability of incorrectly doing so (i.e. low probability of commiting type 1 error).

(g) Plot the straight line within the scatter plot and comment

Base Plot

In order to plot this inside base packages, feed the model object, i.e. lm(Y~X) inside a call to abline() in order to plot the model over the top of the base plot, so all together it might look like: 9

Form <- Sales ~ TV
Lmodel <- lm(formula = Form, data = adv, na.action = na.exclude)
plot(Form, data = adv)
abline(Lmodel)

Or you could do it like this even, but I think the way above is better syntax because it will behave better with ‘predict’ function and follows tidyverse syntax

Lmodel <- lm(adv$Sales ~ adv$TV)
plot(x = adv$TV, y = adv$Sales)
abline(a = Lmodel$coefficients[1], b = Lmodel$coefficients[2])
plot(formula = Sales ~ TV, data = adv, xlim = pdom,
     main = "Sales Given TV Advertising")

abline(lmMod)

GGplot

AdvTVPlot <- ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
  geom_point() + 
  theme_bw() +
  stat_smooth(method = 'lm', formula = y ~ x, se = FALSE) 

AdvTVPlot

If we needed to feed ggplot a specific model we could do that like this, but it’s a whole thing to do and you’d probably rather not do it this way, but if you really really need to

AdvTVPlot <- ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
  geom_point() + 
  theme_bw() +
  stat_smooth(
      method = "lm",
      mapping = aes( y = predict(lmMod)
                     )
      )

AdvTVPlot

(h) Assess the overall accuracy of the model

The model can be assed by considering the:

  • Coefficient of determination \(R^2\) which is the proportion of variance in the data that is explained by the model
  • Only in the case of simple linear regression is \(R^2 = (r)^2\)
  • The Residual Standard Error is the standard deviation of the residuals, i.e. it is the expected distance between each point to the regression line, taken along the \(y\)-axis.

Terminology

The texbook makes, in my opinion, a mistake in that it refers to the the Root Mean Square Error (RMSE) as the Residual Standard Error (RSE) 10, this is true, the standard error of the residuals (\(\varepsilon\)) would be the RMSE, so we would have RMSE \(= \sigma_{\varepsilon}\), that’s fine.

The issue is there is another common term used called the Relative Squared Error (RSE) is often used 11 and so this is hence ambiguous, hence forth I will:

  • Refer to the Standard Error of the residuals ($_{varepsilon}) as RMSE:
    • \(\text{RMSE} = \sqrt{\frac{\sum{\varepsilon ^2}}{n}}\)
  • Refer to the Relative Standard Error as RSE
    • \(\text{RSE} = \frac{\sigma_{\varepsilon} ^2}{\sigma_y ^2} = \frac{\sum{\left( y-\hat{y} \right)^2 }}{ \sum{ \left( y-\bar{y} \right)^2 } }\)
      • The advantage to the RSE is that it can be compared between models with different units, whereas the RMSE cannot, just another tool in the belt I suppose.

Root Mean Square Error

Recall that the model was of the form \(Y = \beta_1 X + \beta_0 + \varepsilon\), the RMSE (Root Mean Square Error) is the standard deviation of \(\varepsilon\) as measured along the \(Y\)-axis:

\[\begin{align} \sigma_{\varepsilon} = \sqrt{\frac{\sum^{n}_{i= 1} \left[ \left( y_i - \hat{y}_i \right)^2 \right]}{N}} \end{align}\]

This value can be returned from R by investigating the anova table:

anova(lmMod)
## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## TV          1 3314.6  3314.6  312.14 < 2.2e-16 ***
## Residuals 198 2102.5    10.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table it can be seen that the average squared residual is 10.6

\[\begin{align} \text{mean}\left( \varepsilon^2\right) &= 10.6\\ \implies \frac{1}{n} \cdot \sum^{n}_{i=1} \left[ \varepsilon_i \right] & =10.6\\ \implies \frac{1}{n} \cdot \sum^{n}_{i=1} \left[ \left( \hat{y}_i - y_i \right)^2 \right] & =10.6\\ \implies \sqrt{\frac{1}{n} \cdot \sum^{n}_{i=1} \left[ \left( \hat{y}_i - y_i \right)^2 \right] } & = 3.2\\ \ \\ \implies \sigma_{\varepsilon} &= 3.2 \end{align}\]

Thus we may conclude that we expect the model to predict the sales within \(\pm\) 3.2 units, which is quite predictive and hence useful.

Coefficient of Determination

The coefficient of determination is the proportion of variation within the model that is explained by the model:

\[\begin{align} R^2 &= \frac{TSS-RSS}{TSS}\\ &= \frac{3315}{3315+2103}\\ \ \\ &= 0.612 \end{align}\]

In practice we would simply extract the coefficient of determination (\(R^2\)) from the model-summary:

lmSum$r.squared %>% round(3) %>% percent()
## [1] "61.2%"

This value suggests that a reasonable amount of the variation is explained by the model, but perhaps a non-linear model could explain more of the variance. (be careful a significant coefficient of determination doesn’t necessarily mean that the slope is significantly different from 0)

Residual Analysis

layout(matrix(1:4, nrow = 2))
plot(lmMod)

  • The residual plot does not appear to normally distributed, there is a slight logarithmic trend, this violates assumptions of the linear model undermining the predictive capacity of the model in this case.
  • The variance is also non-constant, for a linear model to be used in must be homoscedastic (i.e. constant variance), this is not the case implying that the assumptions of the linear model have been violated and hence this model may not be appropriate 12
  • the standardised residuals should be normally distriuted with a mean of 0 and standard deviation of 1, whilst the standard deviation appears acceptable, the standardised residuals are centred around \(\approx 3/4\) with a positive upward slope violating the assumption of normality.
  • The normal Q-Q plot is a straight line so actually the data is probably normally distributed, the only issue is the heteroscedasticity of the data.
  • The Cook’s Distance plot suggests that there are some points with a high amount of leverage, so perhaps there are some outliers or perhaps the increasing variance is undermining the appropriateness of the model.

(i) Use the model to make predictions

When making predictions is important to ensure that the names of a data frame are syntactically correct, otherwise you will have a bad day trying to get predict to work and ggplot2 to work because specifying the data frame names in a formula will be difficult, make sure that names are always syntactically valid.

what is important is you create your model with the correct syntax, if you create your model like this:

mymodelWRONG <- lm(adv$Sales ~ adv$TV)

you won’t be able to predict data like this:

predict(object = lmMod, newdata = data.frame("TV" = 300))

you’ll just get an error that says 'newdata' had 1 row but variables found have 200 rows, you have to give the variables corresponding names so that the model object can save them for later and make the connection, for instance, if inspect the terms from above you will get:

mymodelWRONG[["terms"]]

which outputs, at the tail end:

adv$Sales    adv$TV 
"numeric" "numeric" 

where as if you create the model like this:

lmModCORRECT <- lm(formula = Sales ~ TV, data = adv)
predict(object = lmModCORRECT, newdata = data.frame("TV" = 300))

and inspect the terms with:

lmModCORRECT[["terms"]]

you will get this as output

  Sales        TV 
"numeric" "numeric" 

where Sales and TV are the outputs of names(adv) and so I can use that when I use predict. You should not use attach it will cause problems later, however, it can be nice to use attach just before a predict call to get auto completed names and then remove attach and re-execute the script .

So always use the lm(formula = Y~X, data = myDF) because it works the best; you have to use the same syntax/format when using predict or ggplot anyway so there’s no reason not to use the same syntax throughout anyway.

Also the lecturer said to use lists, I reckon use data frames because that way your newdata matches the input data one-to-one, moreover:

  • It makes it far simpler to assign names, because again, the input/ouptu data will all be the same format
  • when creating Lasso Regression Models you have to use matrices as input data and it’s easier to set your workflow up to go from dataframe to matrix (You have to do this in predictive modelling)

Predict the Data

One Point
input = 3
output <- predict(object = lmMod, newdata = data.frame("TV" = 3))
predDatasingle <- data.frame(input, output)
names(predDatasingle) <- names(adv[c(1,4)])

print(predDatasingle)
##   TV    Sales
## 1  3 7.175203
Multiple points
input <- seq(from = 100, to = 900, by = 100)
output <- predict(object = lmMod, newdata = data.frame("TV" = input))

predDF <- data.frame(input, output)
names(predDF) <- names(adv[c(1,4)])
predDF
##    TV    Sales
## 1 100 11.78626
## 2 200 16.53992
## 3 300 21.29359
## 4 400 26.04725
## 5 500 30.80091
## 6 600 35.55458
## 7 700 40.30824
## 8 800 45.06191
## 9 900 49.81557

Question 02

(a) Upload the Auto Dataset and explore it.

(b) Construct scatter plots to visualize the relationship between mpg and displacement, weight and accellertion:

Repeat the analysis in Q1 (c) to (i) using mpg and weight.


  1. Corellation Coefficient↩︎

  2. PennState University↩︎

  3. i.e. chosing \(\beta_0\) and \(\beta_1\) to minimise \(\left(\textbf{RSS} = \sum^{n}_{i= 1} \left[ \left( y_i - \hat{y_i} \right)^2 \right] \right)\)↩︎

  4. Refer to page 67 of the text book, section [3.1.2]↩︎

  5. Mendenhall, Introduction to Probability & Statistics p. 254 [7.4]↩︎

  6. By the Central Limit Theorem↩︎

  7. Mendenhall, Introduction to Probability & Statistics p. 254 [7.4]↩︎

  8. Mendenhall, Introduction to Probability & Statistics p. 254 [7.4]↩︎

  9. na.exclude will pad values extracted so lengths are the same, na.omit will not↩︎

  10. Refer to Page 69 of the TB for RMSE definition, the TB divides by DF which is probably more correct that dividing by sample size.↩︎

  11. An Introduction to Data Science : Model Evaluation - Regression↩︎

  12. refer to page 96 of the TB, log or exp transforming may be appropriate here, the data is not homosdcedastic and is hence said to be heteroscedastic. #### How to use predict↩︎