Material of Tue 26 2019, week 4
It’s worth bearing in mind that Logistic regression is basically the same as linear regression for an output value that is categorical, i.e. if linear assumptions hold and \(Y \in \mathbb{Z} \wedge Y = f(X) + \varepsilon \implies f = \sum^n_{i=1} \left[a_i \cdot \frac{e^{\alpha + \beta x_i}}{1+e^{\alpha + \beta x_i}} \right]\).
This is a parametric method because all we need to do now is solve parameter coefficients. This is supervised learning because there are output values.
def <- read.csv(file = "Datasets/Default.csv")
head(def)
## X default student balance income
## 1 1 No No 729.5265 44361.625
## 2 2 No Yes 817.1804 12106.135
## 3 3 No No 1073.5492 31767.139
## 4 4 No No 529.2506 35704.494
## 5 5 No No 785.6559 38463.496
## 6 6 No Yes 919.5885 7491.559
# writeLines("\n")
# print("***Dimensions***")
writeLines("\n")
dim(def)
## [1] 10000 5
# writeLines("\n")
# print("***Summary***")
# writeLines("\n")
summary(def)
## X default student balance income
## Min. : 1 No :9667 No :7056 Min. : 0.0 Min. : 772
## 1st Qu.: 2501 Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 5000 Median : 823.6 Median :34553
## Mean : 5000 Mean : 835.4 Mean :33517
## 3rd Qu.: 7500 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :10000 Max. :2654.3 Max. :73554
# writeLines("\n")
# print("***Structure***")
# writeLines("\n")
str(def)
## 'data.frame': 10000 obs. of 5 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
# writeLines("\n")
when glm
is used on a categorical variable that is stored as the factor
class in R the categorical variables will be encoded as dummy variables (i.e. they will become numbers that are stored as factors
), I thought it might be better to do this yourself because otherwise you have to figure out which dummy variable corresponds to what, however at p. 157 of the textook a different approach is taken; glm
was used and then contrasts(def$student)
was used to work out which was what, that’s probably the better way to do it.
You shouldn’t work with dummy variables at all, that’s what contrasts are for, work with contrasts1 instead2 of doing it all by hand, it will prevent you from making a mistake.
If values are already numeric but represent categorical variables, it is necessary to ensure that they are of the factor
class not numeric
/integer
/double
etc. because otherwise the model will perform sub-optimally. This can be tested by using class()
.3
If values are integers but need to be considered as discrete values, it is usually better to convert them into factors
as well, otherwise models will simply assume they are continuous variables and perform suboptimally (it may be worth preserving two colums in a data frame for the factorised and original data, ocassionally plots will complain about discrete data data and it will be easier to feed them the numeric data).
first convert the data from factors to something else just so we can see how to do that, the easiest in this case is a character
(be mindful in R there is no string
class, instead there is only the character
class, this is distinct from C-like languages such as java), again this can be tested by using class()
.
There is also typeof()
which returns the type of the object from the perspective of R, class()
refers to the perspective of an Object-Orientated programming language.
def$default <- as.factor(def$default)
def$student <- as.factor(def$student)
glm(formula = default ~ balance, data = def, family = "binomial")
##
## Call: glm(formula = default ~ balance, family = "binomial", data = def)
##
## Coefficients:
## (Intercept) balance
## -10.651331 0.005499
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9998 Residual
## Null Deviance: 2921
## Residual Deviance: 1596 AIC: 1600
contrasts(def$default)
## Yes
## No 0
## Yes 1
contrasts(def$student)
## Yes
## No 0
## Yes 1
Generally it would be more appropriate to map the output variable to the colour, because colour is easier to see than shape and it is the output variable we are most concerned with.
By using the structure from above we can decide to map shape and colour to the default and student variables respectively.
Ge
predplot <- ggplot(data = def, aes(y = balance, x = income, col = default, shape = student)) +
geom_point(size = 2, alpha = 0.7) +
labs(y = TeX("Account Balance $(X_2)$"), x = TeX("Income ($X_3$)"), shape = TeX("Student ( X_1 )"), col = "Default (Y)",
title = "Account Balace against Income and Defaut") +
theme_classic()
predplot
#ggplotly(predplot) # This does not support latex!
It’s hard to see whether or not there are any defaults in the centre of the plot, hence map the alpha-level to the default and let’s invert this plot so that the split occurs along left/right, that way it will resemble our box plots (this is done simply for want of being able to visualise the data, both the axis are predivitive features so it doesn’t matter):
ggplot(data = def, aes(x = balance, y = income, col = default, shape = student)) +
geom_point(size = 5, aes(alpha = default)) +
labs(x = TeX("Account Balance $(X_2)$"), y = TeX("Income ($X_3$)"), shape = TeX("Student ( X_1 )"), col = "Default (Y)",
title = "Account Balace against Income and Defaut") +
theme_classic()
## Warning: Using alpha for a discrete variable is not advised.
Now it can be clearly seen that defaults occur at a higher frequency when there is a higher account balance, and maybe slightly higher for students than non-students (there;s mayve very slightly less triangles on the right side than left.)
It seems to be totally independent of income.
It is necessary to observe the behaviour of the output variable in response to all the predictors, however it is innapropriate to use a scatterplot because it comes out almost useless:
A boxplot is the correct way to do this,4 using the output along the \(x\)-axis and the predictor along the \(y\) axis, we can create two sets of this in ggplot by mapping the x=student
and fill = default
.
ggplot(data = def, aes(x = student, y = balance, fill = default)) + #The predict
geom_boxplot() +
theme_classic() +
labs(y = TeX("Balance ($X_2$)"), x = TeX("Student ($X_1$)") , fill = TeX("Default ($Y$)"),
title = TeX("Default given Status as student and Balance")) +
scale_fill_brewer(palette = "Pastel1")
From here it can clearly be seen that there are some outliers for balance but it is a significant predictorv of defaulting, it can also be seen that generally students are more likely to default but this is a non-significant observation.
In my opinion the appropriate
The textbook chose to visualise the boxplots as below, I think this is undesirable because the interaction of income is already clear from the preceeding plot and because student has a more significant impact than income that should be considered.
balplot <- ggplot(data = def, aes(x = default, y = balance, fill = default)) +
geom_boxplot() +
labs(x = TeX("Default $(Y)$"), y = TeX("Balance $(X_2)$"), title = "Default given Account Balance") +
theme_classic2() +
guides(fill = FALSE) +
scale_fill_brewer(palette = "Pastel1")
incplot <- ggplot(data = def, aes(x = default, y = income, fill = default)) +
geom_boxplot() +
labs(x = TeX("Default $(Y)$"), y = TeX("Balance $(X_33$"), title = "Default given Account Balance") +
theme_classic2() +
guides(fill = FALSE) +
#scale_fill_discrete(direction = 1, h.start = 99, )
scale_fill_brewer(palette = "Pastel1")
grid.arrange(balplot, incplot, nrow = 1)
plot(income ~ balance, data = def,
main = "Credit Defaults",
xlab = "Income",
ylab ="Balance",
pch = c(1, 2)[as.numeric(student)],
col = c("IndianRed", "Royalblue")[as.numeric(default)],
cex = 0.7)
In order to create a logistic model use the glm
function which is a generalised linear model function, because this is just an exponentiated linear function.
family = binomial
link = logit
probit
for when the seperation of data is distinct, it performs slightly better.ModDef <- glm(formula = default ~ student + balance + income, family = binomial(link = "logit"), data = def)
contrasts(def$default)
## Yes
## No 0
## Yes 1
In order to find the probability just use the predict command:
Use a data frame rather than a list, that way the format of your input, output and predicted data is all equivalent.
newdata <- data.frame(
# "student" = as.factor(c("Yes", "Yes", "No", "Yes")), # Normal
#If I wanted to create random points.
"student" = ifelse(sample(c(0, 1), size = 4, replace = TRUE)==0, "No", "Yes"),
"balance" = rnorm(n = 4, 1500, sd = 500),
"income" = rnorm(n = 4, mean = 50000, sd = 15000)
)
Now call the predict
command and use cbind
to combine the newdata
with the predicted data.
By default the predict command will output the value of the logit or the log-odds:5
\[
\verb|output| = \log\left( \frac{ \text{Pr}\left(X\right) }{1 - \text{Pr}\left(X\right) } \right)
\] Clearly the raw probability would be considerably more useful to us, hence specify the parameter type="Response"
:
predVals <- predict(object = ModDef, newdata, type = "response")
cbind("default_prob" = predVals, newdata)
## default_prob student balance income
## 1 0.003597183 Yes 996.0236 58720.54
## 2 0.718389588 Yes 2131.6236 73921.83
## 3 0.106698601 Yes 1611.5783 48155.68
## 4 0.934770901 No 2334.8403 45366.55
all the corresponding values can be returned by not specifying the newdata
parameter in the predict command, it would also be possible to return the fitted.values
from the model object, the fitted values will represent the probability because it is the natural position along the \(y\)-axis, it is the fitted value
ModelPreds <- cbind(def, "Probability" = ModDef$fitted.values)
#ModelPreds <- cbind(def, "Probability" = predict(ModDef, type = 'response')) #This would do the same thing
In order to move from the probability to the decision the threshold would have to be specified:
threshold <- 0.5
ModelPreds$prediction <- rep_len(x ="No", length.out = nrow(ModelPreds))
ModelPreds$prediction[ModelPreds$Probability > threshold] <- "Yes"
ModelPreds$prediction <- as.factor(ModelPreds$prediction)
ModelPreds <- ModelPreds[,c(2,6,7,3,4,5)]
head(ModelPreds)
## default Probability prediction student balance income
## 1 No 0.0014287239 No No 729.5265 44361.625
## 2 No 0.0011222039 No Yes 817.1804 12106.135
## 3 No 0.0098122716 No No 1073.5492 31767.139
## 4 No 0.0004415893 No No 529.2506 35704.494
## 5 No 0.0019355062 No No 785.6559 38463.496
## 6 No 0.0019895182 No Yes 919.5885 7491.559
#Alternative approach
#threshold <- 0.5
#Predictions <- ifelse(ModelPreds$Probability < threshold, 0, 1)
#ModelPreds <- cbind(ModelPreds, "Prediction" = Predictions)
The misclassification matrix (or confusion matrix) is a matrix for the False and True Positives, it can be a good idea to wrap this in a function because later on it may be necessary to use a loop to return a ROC curve.
This is the method that oliver taught us
This is the harder way to do it, this is what Oliver taught us
trupos <- ifelse(ModelPreds$default == ModelPreds$Prediction, 1, 0 ) #fpr
trupos <- sum(trupos)
trupos
FalsePos <- nrow(ModelPreds) - trupos
FalseNeg <- ifelse(ModelPreds$)
This is a more elegent strategy using table(predictions, observations)
, the order should be specified as such by convention I have seen in the TB and online, this way also matches the confusionMatrix()
functoin inside the caret
package
By convention the rows will represent the predicted values and the columns will represent the observed values. . however one page of the textbook shows it in print the otherway but generates in code this way so that’s a little confusing.
conf.mat <- table(ModelPreds$prediction, ModelPreds$default, dnn = c("Predicted", "Observed"))
print(conf.mat)
## Observed
## Predicted No Yes
## No 9627 228
## Yes 40 105
# It can be a little difficult to determine which one is the prediction using table
#if you don't get the order of prediction then observation right, so
# you can also use `confusionMatrix` from the `caret` package to triple check.
conf.mat2 <- confusionMatrix(ModelPreds$prediction, ModelPreds$default)
print(conf.mat2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 9627 228
## Yes 40 105
##
## Accuracy : 0.9732
## 95% CI : (0.9698, 0.9763)
## No Information Rate : 0.9667
## P-Value [Acc > NIR] : 0.0001044
##
## Kappa : 0.4278
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9959
## Specificity : 0.3153
## Pos Pred Value : 0.9769
## Neg Pred Value : 0.7241
## Prevalence : 0.9667
## Detection Rate : 0.9627
## Detection Prevalence : 0.9855
## Balanced Accuracy : 0.6556
##
## 'Positive' Class : No
##
In order to determine the missclassification rate we would, in this case the diagonals are the correct predictions and the off-diagonals are misclassifications
misclass <- (40+228)/(9627+228+40+105)
print(paste("The misclassification rate is: ", percent(misclass)))
## [1] "The misclassification rate is: 2.68%"
in practice we would just take the mean value of the missclassifications, because under the hood true and false correspond to 1/0 in R.
misclass <- mean(ModelPreds$default != ModelPreds$prediction)
print(paste("The misclassification rate is: ", percent(misclass)))
## [1] "The misclassification rate is: 2.68%"
Incorrect will be off-diagonals.
The False positive rate will be the number of times that the prediction is positive but the observation is negative, so the second row, first column of the confusion matrix.
The False Negative rate will be the number of times that the prediction is negative but the observation is positive, so first row second column:
conf.mat
## Observed
## Predicted No Yes
## No 9627 228
## Yes 40 105
FPR <- conf.mat[2,1] / nrow(ModelPreds)
FNR <- conf.mat[1,2] / nrow(ModelPreds)
The false positive rate is 0.004 and the False Negative rate is 0.0228
The true positive rate is the number of positives that are true divided by the number of positives observed $$ = \ \ ( 1- ) =
$$
\[ \textit{Sensitivity} \text{ is } \textbf{TruePosRate}\\ \textit{Specificity} \text{ is } \textbf{TrueNegRate} \]
A ROC Curve is the TruePosRate (i.e. the Sensitivity) on the \(y\)-axis against the FalsePosRate (i.e. 1-Sensitivity) on the x-axis:
First Create a tpr function:
tpr <- function(model, observation, threshold, silent){
probability <- model$fitted.values
prediction <- rep_len(x ="No", length.out = length(observation))
prediction[probability > threshold] <- "Yes"
prediction <- as.factor(prediction)
conf.mat <- table(prediction, observation, dnn = c("Predicted", "Observed"))
conf.mat2 <- confusionMatrix(prediction, observation)
if (!silent) {
print(conf.mat2)
}
tpr <- conf.mat2[["byClass"]][["Sensitivity"]]
# tpr <- 1- conf.mat[2,2]/(conf.mat[2,2] + conf.mat[1,1])
fnr <- 1-conf.mat2[["byClass"]][["Specificity"]]
return(c(tpr, fnr))
}
tpr(ModDef, def$default, threshold = 0.5, silent = FALSE)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 9627 228
## Yes 40 105
##
## Accuracy : 0.9732
## 95% CI : (0.9698, 0.9763)
## No Information Rate : 0.9667
## P-Value [Acc > NIR] : 0.0001044
##
## Kappa : 0.4278
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9959
## Specificity : 0.3153
## Pos Pred Value : 0.9769
## Neg Pred Value : 0.7241
## Prevalence : 0.9667
## Detection Rate : 0.9627
## Detection Prevalence : 0.9855
## Balanced Accuracy : 0.6556
##
## 'Positive' Class : No
##
## [1] 0.9958622 0.6846847
Now run a loop to try every threshold
head(Smarket)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
#Check that it's a factor
class(Smarket$Direction)
## [1] "factor"
#Make it a factor
Smarket$Direction <- as.factor(Smarket$Direction)
LogModSmarket <- glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial(link = "logit"))
Generate some newdata
n <- 4
maxval <- max(Smarket$Today)
minval <- min(Smarket$Today)
newdata <- data.frame(
Lag1 = sample(seq(from = minval, to = maxval, length.out = 1000), size = n),
Lag2 = sample(seq(from = minval, to = maxval, length.out = 1000), size = n),
Lag3 = sample(seq(from = minval, to = maxval, length.out = 1000), size = n),
Lag4 = sample(seq(from = minval, to = maxval, length.out = 1000), size = n),
Lag5 = sample(seq(from = minval, to = maxval, length.out = 1000), size = n),
Volume = sample(seq(
from = min(Smarket$Volume), to = max(Smarket$Volume), length.out = 1000)
, size = n)
)
Now feed this data into the model:
predicteddata <- predict(object = LogModSmarket, newdata, type = "response")
predicteddata <- data.frame(Probability = predicteddata, newdata )
In order to create the predictions use contrasts
to check how the dummy variables have been assigned.
contrasts(Smarket$Direction)
## Up
## Down 0
## Up 1
Hence Up is 1 so in order to create the prediction from the probability (assuming an 0.5 threshold):
predicteddata$Predictions <- "Down"
predicteddata$Predictions[predicteddata$Probability > 0.5] <- "Up"
# Don't forget that it needs to be a factor,
# Because of some really strange behaviour in R this MUST be done last
predicteddata$Predictions <- as.factor( predicteddata$Predictions)
predicteddata <- predicteddata[,c(8, 1:7)]
# If Else could also have been used like this
#This is neat because it's using if/else across data without a loop
#Just rember that this returs values, it won't assign them inside like a
#loop otherwise would
predicteddata$Predictions <- ifelse(predicteddata$Probability > 0.5, "Up", "Down")
predicteddata
## Predictions Probability Lag1 Lag2 Lag3 Lag4
## 1 Up 0.5013020 -0.8583814 3.3012282 3.4932102 -4.335388
## 2 Up 0.5845357 -0.8263844 -0.8477157 4.1331502 -3.492801
## 3 Down 0.4970459 -2.2235866 4.2398068 4.0478248 -3.716780
## 4 Down 0.4439511 0.8587908 3.3332252 0.6134805 -1.988942
## Lag5 Volume
## 1 -4.6446927 1.9040272
## 2 -0.9117097 2.7129957
## 3 -4.8580060 1.2630105
## 4 -1.6583063 0.9858898
First call upon the model (or predict) to get the modelled values of the data:
glm(Y~X)$fitted.values
will always return the probabilities, The fitted values are the y-values, naturally the prob.type=response
# Calling from the model will always return the probabilities,
# The fitted values are the y-values, naturally the prob.
preds <- LogModSmarket$fitted.values
# If you want to youse predict, don't specify new data and it
#will by default return the training data, however the values
# will be the log-odd values, make sure to specify
#type=response
probs <- predict(LogModSmarket, type = 'response')
preds <- ifelse(probs > 0.5, "Up", "Down")
preds <- as.factor(preds)
The misclassification matrix may be made with:
table
command
prediction, observation
confusionmatrix()
command from the caret
package
conf.mat <- table(preds, Smarket$Direction, dnn = c("Predictions", "Observations"))
conf.mat
## Observations
## Predictions Down Up
## Down 145 141
## Up 457 507
# Using confusionMatrix (recall that the `contrasts` told us up was one,
# hence that is specified as positive)
conf.obj <- confusionMatrix(data = preds, reference = Smarket$Direction )
conf.obj
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 145 141
## Up 457 507
##
## Accuracy : 0.5216
## 95% CI : (0.4935, 0.5496)
## No Information Rate : 0.5184
## P-Value [Acc > NIR] : 0.4216
##
## Kappa : 0.0237
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2409
## Specificity : 0.7824
## Pos Pred Value : 0.5070
## Neg Pred Value : 0.5259
## Prevalence : 0.4816
## Detection Rate : 0.1160
## Detection Prevalence : 0.2288
## Balanced Accuracy : 0.5116
##
## 'Positive' Class : Down
##
The diagonals of the matrix represent the correct classifications and the off-diagonals represent the misclassifications, so the misclassification rate is:
(conf.mat[1,2] + conf.mat[2,1])/sum(conf.mat)
## [1] 0.4784
# This may also have been returned from
1-conf.obj$overall["Accuracy"]
## Accuracy
## 0.4784
The false positive rate is:
\[\begin{align} \textsf{FPR}&= \frac{FP}{Neg}\\ &= \frac{FP}{FP + TN}\\ &= 1 - TNR \end{align}\]
If/when you forget that during the exam remember you can just use confusionMatrix()
command inside the caret
package.
#I've made a mistake here, down is positive when I would have rathered it the other way and is too late in the evening to fix that,
# It makes it a mess but whatever.
# From the confusion Matrix
conf.mat
## Observations
## Predictions Down Up
## Down 145 141
## Up 457 507
FPR <- (conf.mat[1,2])/(sum(conf.mat[,2]))
# Using `caret::confusionMatrix()`
# Specificity is the TNR
#FPR is 1-TNR so use
FPR
## [1] 0.2175926
1-conf.obj$byClass["Specificity"][1]
## Specificity
## 0.2175926
The False Negative rate may be determined by swapping every occurence of positive with negative and vice versa, hence;
The False Negative Rate is given by:
\[\begin{align} \textsf{FNR}&= \frac{FN}{Pos}\\ &= \frac{FN}{FN + TP}\\ &= 1 - TPR \end{align}\]
#I've fucked up here, dowsn is positive and is too late in the evening to fix that,
# It makes it a mess but whatever.
# From the confusion Matrix
# It's false so we are concerned with the offset
# Just choose the other offset and the other column
FNR <- (conf.mat[2,1])/(sum(conf.mat[,1]))
# Using `caret::confusionMatrix()`
# Specificity is the TNR
#FNR is 1-TPR so use Sensitivity instead
FNR
## [1] 0.7591362
1-conf.obj$byClass["Sensitivity"][1]
## Sensitivity
## 0.7591362
James, Gareth. 2017. An Introduction to Statistical Learning. Springer.
DataMentor and StackOverflow and StackExchange↩︎
Refer to page 129 of ISL (James 2017)↩︎
Refer to page 133 Equation 4.4 of ISL↩︎