Multiple Linear Regression

Material of Tue 19 March2019, week 3

Questoin 01 - Multiple Linear Regression

(a) Upload the data “Advertising.csv” and explore it.

First import the data and investigate it:

adv <- read.csv(file = "Datasets/Advertising.csv", header = TRUE, sep = ",")
head(adv)
##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
    writeLines("\n")
    print("***Dimensions***")
## [1] "***Dimensions***"
    writeLines("\n")
dim(adv)
## [1] 200   4
    writeLines("\n")
    print("***Summary***")
## [1] "***Summary***"
    writeLines("\n")
summary(adv)
##        TV             Radio          Newspaper          Sales      
##  Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
##  1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:10.38  
##  Median :149.75   Median :22.900   Median : 25.75   Median :12.90  
##  Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :14.02  
##  3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:17.40  
##  Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00
    writeLines("\n")
    print("***Structure***")
## [1] "***Structure***"
    writeLines("\n")
str(adv)
## 'data.frame':    200 obs. of  4 variables:
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ Radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ Newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ Sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
    writeLines("\n")

Fom this we can tell that there is one output, with 3 input values and 200 Observations.

(b) Find the Covariance and Correlation Matrix of Sales, TV, Radio and Newspaper.

Base Packages

pairs(x = adv)

corrplot

In order to use corrplot first create a correlation matrix using cor(adv) then feed that matrix to corrplot with the command corrplot(cor(adv))

#coriris <- cor(iris[,!(names(iris) == "Species")])
#corrplot(method = 'ellipse', type = 'lower', corr = coriris)

corMat <- cor(x = adv)
corrplot(corr = corMat, method = "ellipse", type = "lower")

From this we can see that there is a significant amount of correlation between Radio and newspaper (moreso even that newspaper and sales), we should consider this variable interaction when decide upon our model

(c) Construct the multiple linear regression model and find the least square estimates of the model

A multiple linear regression would give the model:

advModMult <- lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
advModMult
## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
## 
## Coefficients:
## (Intercept)           TV        Radio    Newspaper  
##    2.938889     0.045765     0.188530    -0.001037

This gives that the appropriate linear model is:

\[ Y_{Sales} = 0.0458 \times \text{TV} + 0.19 \times \text{Radio} - 0.001 \times \text{Newspaper} \] the fact that Newspaper has a negative coefficient despite being positively correlated with sales is indicative of the weak effect newspaper advertising has on sales as well as the interaction between newspaper and sales.

from this it can be ### (d) Test the significance of the parameters and find the resulting model to model Sales in terms of advertising modes, TV, Radio and Newspaper. First Summarise the Model:

advModMult %>% summary()
## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

A summary of the model provides that given the hypothesis test: \[ H_0: \enspace \beta_i = 0\\ H_a: \enspace \beta_i \neq 0 \\ \qquad \qquad \qquad \qquad \forall i \in \mathbb{N} \]

There would be an extremely low probability of incorrectly rejecting the null hypothesis that the given a linear model the coefficients would be zero, hence it is accepted that the coefficients are non-zero except for newspaper advertising.

There is a high probability of incorrectly rejecting the null-hypothesis and hence that should not be rejected and the nespaper advertising should not be seen as significant predictor for sales in this linear model.

It is hence approprate to remove, via backwards selection, Newspaper from the model, which gives:

## Calculate Power??

# n <-  length(adv)
# sigma <-  sd(adv$Newspaper)
# sem = sigma/sqrt(n)
# alpha <-  0.05
# mu0 <- 0
# q <- qnorm(p = 0.005, mean = mu0, sd = sem);q
# mu <- q #assumed actual mean value
# pnorm(q, mean = mu, sd = sem, lower.tail = FALSE)
advModMultb1 <- lm(formula = Sales ~ TV + Radio, data = adv)
advModMultb1
## 
## Call:
## lm(formula = Sales ~ TV + Radio, data = adv)
## 
## Coefficients:
## (Intercept)           TV        Radio  
##     2.92110      0.04575      0.18799
advModMultb1.sum <- summary(advModMultb1)

\[ \text{Sales} = 0.045 \times \text{TV} + 0.19 \times \text{Radio} + 2.9211 \]

All parameters are highly significant and hence the model is deemed adequate, the model explains 100% of the variation, this can be found by appending $r.squared to the model object.

(e) Assess the overall accuracy of the model.

In order to assess the model, consider the anova table and derive the RMSE and \(R^2\) values:

advModMultb1.anova <- anova(advModMultb1)
advModMultb1.anova
## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## TV          1 3314.6  3314.6 1172.50 < 2.2e-16 ***
## Radio       1 1545.6  1545.6  546.74 < 2.2e-16 ***
## Residuals 197  556.9     2.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this we can see that that the \(F\) statistic is associated with a very low p-value, these are the exact same p-values from the summary call.

RMSE

The RMSE value is the Root mean square error, it is the standard error of the residuals (recall that the standard error is the standard deviation of a model parameter), so the RMSE is basically the standard deviation of the residuals of the model (as measured along the \(y\)-axis).

\[ \text{RMSE} = \sqrt{\frac{\sum{\varepsilon^2}}{n}} \]

rmse <- function(model){
# (sum(model$residuals**2)/length(model$residuals))**0.5
sd(advModMultb1$residuals)
}

rse <- function(model){
  var(model$residuals)/var(model$residuals - model$fitted.values)
}

data.frame("RMSE" = rmse(advModMultb1) , "RSE" = rse(advModMultb1) )
##       RMSE       RSE
## 1 1.672891 0.1028057

So the expected error of the model is \(\pm\) 1.7 units of sale, we could use this to create a confidence interval by multiplying by the corresponding Student’s t-statistic and saying that we would expect an observed value to lie within \(1.96 * \text{S.E}\) of the model 95% of the time (is this correct or is it the expected mean or something because of the Central Limit Theorem?).

Coefficient of Determination

The coefficient of determination is 89.7%, this can be returned by extracting the value from the object by apending $r.squared to the model summary (i.e. use summary(lm( Y ~ X1 + X2 )))$r.squared).

The \(R^2\) value will be very nearly the same between the initial model and the backwards selected model, however given that the initial model had non-significant predictors it could be considered as over-parameterized, i.e. it violates Occam’s Razor.

The coefficient of determination is the ratio of the total variance of the data that is explained by the model, in this case it could be determined by:

\[ R^2 = \frac{TSS-SSE}{TSS} = \frac{3314.6+1546+0.1}{3315+1545+0.1+556.8} = 89\% \ \\ \]

Coefficient of Determination from ANOVA

The \(R^2\) value is derived from the ANOVA table thusly:

advModMultb1.anova <- anova(advModMultb1)
TSS_Multb1 <- advModMultb1.anova$`Sum Sq` %>% sum()
RSS_Multb1 <- advModMultb1$residuals^2 %>% sum()

((TSS_Multb1- RSS_Multb1)/(TSS_Multb1)) %>% signif(3) %>% percent()   #Requires `scales` package
## [1] "89.7%"

Residual analysis

When determining the accuracy or performance of amodel, always turn your mind to residual analysis (i.e. how normally are they distributed), this will be performed below.

(f) Calculate the predicted values and residuals

The residuals and fitted values may be returned by extracting them from the model (you could always derive them from first principles but you should use the tools at your disposal, it is quicker, less error prone and makes more readable code)

ResDF <- data.frame("Input" = advModMultb1$fitted.values - advModMultb1$residuals, "Output" = advModMultb1$fitted.values, "Error" = advModMultb1$residuals) 
ResDF %>% head()
##      Input   Output      Error
## 1 19.01093 20.55546  1.5445354
## 2 14.29072 12.34536 -1.9453623
## 3 15.37404 12.33702 -3.0370177
## 4 16.73423 17.61712  0.8828840
## 5 13.54782 13.22391 -0.3239081
## 6 17.82417 12.51208 -5.3120845

Predict will also return fitted values if no newdata is specified.

(g) Plot the residuals against the predicted values

ggplot

ggplot(data = ResDF, aes(x = Output, y = Error, col = Error )) +
  geom_point() +
  theme_bw() +
  stat_smooth(col = "grey")+
  stat_smooth(method = "lm", se = 0, ) +
scale_color_gradient(low = "indianred", high = "royalblue") +
  labs(y = "Residuals", x = "Predicted Value", title = "Residuals Plotted against Output")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

base

plot(Error ~ Output, data = ResDF, pch = 19, col = "IndianRed", main = "Residuals against Predictions")
smooth <- loess(formula = Error ~ Output, data = ResDF, span = 0.8)
predict(smooth) %>% lines(col = "Blue")
abline(lm(Error ~ Output, data = ResDF), col = "Purple")

From this it can be seen that the residuals are centred around zero, but they are not normally distributed, the model performance appears to fail normality assumptions for values \(\in \left( 10, 20 \right)\)

(h) Plot the histogram of the residuals

ggplot

Absolute Count
ggplot(data = ResDF, aes(x = Error, col = Output)) +
  geom_histogram() +
  theme_classic() +
  labs(y = "Absolute Count", x = "Residual Value", title = "Residual Distribution") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Density
df <- data.frame(x = rnorm(1000, 2, 2))

# overlay histogram and normal density
ggplot(ResDF, aes(x=Error)) +
  geom_histogram(aes(y = stat(density))) +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(df$x), sd = sd(df$x)), 
    lwd = 2, 
    col = 'IndianRed'
  ) +
  theme_classic() +
  labs(title = "Histogram of Residuals", y = "Density of Observations", x= "Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

White Noise

We can visualise the residuals as white noise:

Our Residuals
# Put some White Noise on the ResDF
ResDF <- cbind(ResDF, "Wnoise" = rnorm(nrow(ResDF), mean = 0, sd = sd(ResDF$Error)))


## Our Residuals
ggplot(ResDF, aes(x = Input, y = Error, col = Output)) +
  geom_line() +
  geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
  theme(legend.position = "none") + 
  labs(title = "Distribution of Residuals")

## White Noise

ggplot(ResDF, aes(x = Input, y = Wnoise, col = Output)) +
  geom_line() +
  geom_abline(slope = 0, intercept = 0, lty = "twodash", col = "IndianRed") +
  theme(legend.position = "none") + 
  labs(title = "Distribution of Residuals")

This clearly shows that the Residuals are non-normal.

Base

hist(ResDF$Error
     , breaks = 30,
     prob=TRUE,
     lwd=2,
    main = "Residual Distribution",
    xlab = "Distance from the Model", border = "purple"
    ) 

  #Overlay the Normal Dist Curve
x <- 1:100    #Stupid base package wants some f(x), this is hacky but easier than stuffing around with lines or defining a function
curve(dnorm(x, mean(ResDF$Error), sd(ResDF$Error)), add=TRUE, col="purple", lwd=2) #Draws the actual density function
                
#lines(density(possibleyvals_conf), col='purple', lwd=2) #Draws the observed density function
lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))
abline(v=upr_conf, col='pink', lwd=3)  
abline(v=lwr_conf, col='pink', lwd=3)     
abline(v=mean(ResDF$Error), lwd=2, lty='dotted')  

lwr_conf <- qnorm(p = 0.05, mean(ResDF$Error), sd(ResDF$Error))
upr_conf <- qnorm(p = 1-0.05, mean(ResDF$Error), sd(ResDF$Error))

It hence appears that the Residuals skewed left, which suggests an upper bound on the residual value, the histogram is sufficiently non-normal to reject the assumption that the residuals are normally distributed

(i) Comment on the residual plots

The Residuals can be analised by plotting the model:

plot(advModMultb1)

This is an inappropriate model because the residuals are non-normally distributed.

(j) Use the multivariate model for predictions

newdata =  data.frame("TV" = c(3, 1, 2), "Radio" = c(4, 5, 6))
Mod_Adv_Mult_b1 <- predict(object = advModMultb1, newdata)

mypreds <- data.frame("Input" = newdata, "Output" =  Mod_Adv_Mult_b1)
mypreds # Careful with the names, they workout as Input.name in this case.
##   Input.TV Input.Radio   Output
## 1        3           4 3.810341
## 2        1           5 3.906826
## 3        2           6 4.140575

Question 02: Non Linear Models: Use Advertising data set

(a) Add the Interaction Term TV*Radio and test the significance of the interaction term {.tabset}

reconsider the correlationplots:

Base

pairs(x = adv)

corrplot

adv[, names(adv)!="Newspaper"] %>% cor() %>% corrplot(method = "ellipse", type = "lower")

From this we can determine that there is no interaction between TV and radio (however previously there was interaction between newspaper and Radio, we should consider that next)

Create the MLReg

int_mod_adv <- lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
int_mod_adv
## 
## Call:
## lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
## 
## Coefficients:
## (Intercept)           TV        Radio     TV:Radio  
##    6.750220     0.019101     0.028860     0.001086
summary(int_mod_adv)
## 
## Call:
## lm(formula = Sales ~ TV * Radio + TV + Radio, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

(b) Give the resulting model after considering this interaction term.

The model will be of the form:

\[ \text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086 \]

Note that all the terms of the model are significant, hence we deem the model as adequate.

(c) Construct the Polynomial Regression Model of order 3 and test the model significance

When creating polynomial models there are raw and orthogonal polynomials,

# Orthogonal (fixes the correlation between the coefficients))

polymodel_Orth <- lm(Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
summary(polymodel_Orth)
## 
## Call:
## lm(formula = Sales ~ poly(x = TV, degree = 3, raw = FALSE), data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                             14.0225     0.2286  61.353
## poly(x = TV, degree = 3, raw = FALSE)1  57.5727     3.2322  17.812
## poly(x = TV, degree = 3, raw = FALSE)2  -6.2288     3.2322  -1.927
## poly(x = TV, degree = 3, raw = FALSE)3   4.0074     3.2322   1.240
##                                        Pr(>|t|)    
## (Intercept)                              <2e-16 ***
## poly(x = TV, degree = 3, raw = FALSE)1   <2e-16 ***
## poly(x = TV, degree = 3, raw = FALSE)2   0.0554 .  
## poly(x = TV, degree = 3, raw = FALSE)3   0.2165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16
# Pure/Raw (has the advantage that you can interpret the coefficients, but the coefficients will depend on each other and be hence correlated.)

polymodel <- lm(Sales ~ I(TV*TV*TV) + I(TV*TV) + (TV), data = adv)
summary(polymodel)    #I is used to inhibit the interpretation of * as relating to the model,
## 
## Call:
## lm(formula = Sales ~ I(TV * TV * TV) + I(TV * TV) + (TV), data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.420e+00  8.641e-01   6.272 2.23e-09 ***
## I(TV * TV * TV)  5.572e-07  4.494e-07   1.240 0.216519    
## I(TV * TV)      -3.152e-04  2.022e-04  -1.559 0.120559    
## TV               9.643e-02  2.580e-02   3.738 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16
                            # Instead of representing interaction it represents TV^3

polymodel_Raw<- lm(Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
summary(polymodel_Raw)
## 
## Call:
## lm(formula = Sales ~ poly(x = TV, degree = 3, raw = TRUE), data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            5.420e+00  8.641e-01   6.272
## poly(x = TV, degree = 3, raw = TRUE)1  9.643e-02  2.580e-02   3.738
## poly(x = TV, degree = 3, raw = TRUE)2 -3.152e-04  2.022e-04  -1.559
## poly(x = TV, degree = 3, raw = TRUE)3  5.572e-07  4.494e-07   1.240
##                                       Pr(>|t|)    
## (Intercept)                           2.23e-09 ***
## poly(x = TV, degree = 3, raw = TRUE)1 0.000243 ***
## poly(x = TV, degree = 3, raw = TRUE)2 0.120559    
## poly(x = TV, degree = 3, raw = TRUE)3 0.216519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16

The 3rd degree coefficient is not significant, hence we consider the 2nd degree:

quadmod <- lm(formula = Sales ~ I(TV*TV) + TV, data = adv)
summary(quadmod)
## 
## Call:
## lm(formula = Sales ~ I(TV * TV) + TV, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6844 -1.7843 -0.1562  2.0088  7.5097 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.114e+00  6.592e-01   9.275  < 2e-16 ***
## I(TV * TV)  -6.847e-05  3.558e-05  -1.924   0.0557 .  
## TV           6.727e-02  1.059e-02   6.349 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.237 on 197 degrees of freedom
## Multiple R-squared:  0.619,  Adjusted R-squared:  0.6152 
## F-statistic: 160.1 on 2 and 197 DF,  p-value: < 2.2e-16

The quadratic term is not sufficiently significant so the model is rejected

(d) Give the resulting selected model

The model selected is the multiple linear regression with the interaction from before:

\[ \text{Sales} = 0.0019 \times \text{TV} \times + \text{Radio} \times 0.002886 + \text{TV} \times \text{Radio} \times 0.001086 \]