Material of 5 March 2019, week 1
Data is classified as either structured or unstructured:
In order to deal with discrete variables uses a data type called factors, to create factors the factor()
command is used, within this command a vector containing factor levels must be enclosed, e.g.
factor(c("Male", "Female"))
## [1] Male Female
## Levels: Female Male
When performing a multiple linear regression with categorical data, the cqategorical data will be treated as a boolean 1/0
variable, basically choosing between different categorical variables is choosing a different intercept for the regression (i.e. adding a constant value)
under the hood 1 corresponds to True and 0 corresponds to false
This is not as accurate as using a linear regression seperately on the data within that category, so, unless there is a good reason i.e. the different categories would have a trend with the same rate but a different intercept, e.g. the ambient temperature of a location would have different mean values for each month:
\[ \text{Response} = -74 = 3.109 \cdot X_{\text{wind}} - 1.875\cdot X_{\text{temp}} + 14.76 \cdot \left( \text{Jan} \right) + 8.75 \cdot \left( \text{Feb} \right) +4.197 \cdot \left( \text{Mar}) \right) \]
in this case \(\text{Jan}\), \(\text{Feb}\) etc. would be a True
/False
(\(\equiv\) 1
/0
) value indicating whether or not to include that constant in the equation (because it is or isn’t that date) so if in this example Sydney is an everage of 23 degrees, maybe January is on average 14 degrees hotter and the coefficient for July might be -14 because in July it is 14 degrees cooler.
Machine Learning problems are often split into two categories, supervised and unsupervised.
Supervised Learning involves data where each observational unit has one special variable, (e.g. survive/perish, amount spent).
Unsupervised Learning Is about pattern discovery, there is no clear special variable (e.g. trying to detect unusual spending or grouping spenders into seperate groups).
Supervised Learning has a response variable (also known as outcome or in calculus as an independent variable, \(Y\)) and the idea is to predicting the relationship between the output and several inputs:
In a classification problem \(Y\) takes the value of a discrete variable (e.g. survived/perished), In a regression problem \(Y\) takes a \(\mathbb{R}\) quantitative value (e.g. \(\$72.45\) or \(55 \text{ kg}\))
In Mathematical terms, if \(y\) was the output and \(x_1, x_2, x_3 \dots\) the input, the model would be the expected value of \(y\) given the inputs, denoted \(E(y)\), defined by some function \(f\):
\[ E(y) = f(x_1, x_2, x_3, \dots) \]
The observed output is expected to vary in value owing to errors in measurement (so for example even though we have very strong mathematical evidence for a relationship, e.g. \(S = \frac{1}{2}at^2)\) we would expect the observed values to contain error.
n <- 100
x <- 1:n
y <- 0.5*9.81*x^2 + rnorm(n, mean = 0, sd = 2000)
##layout(matrix(1:2, ncol =2))
#Using baseplot
#model <- lm(y ~ poly(x, degree = 2))
#plot(y~x)
#Expy <- predict(object = model)
#lines(Expy)
#Using GGplot2
egdata <- data.frame(height = y, time = x) #create dataframe
ggplot(data = egdata, aes(x = time, y = height)) + # call ggplot2
geom_point(col = "#7f7caf", size = 3, alpha = 0.5) + #plot the points for the data
stat_smooth(col = "#7d2e68", method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) + # draw the polynomial model
theme_classic() # change the theme
# use https://coolors.co/app for colour themes
It’s also necessary to choose the right type of model, for example below the function chosen on the left is a simple linear regression, but maybe it’s appropriate to assume that there is a seasonal or cyclical trend (a cyclical trend being less predictable like a recession and a seasonal being more predictable like seasons).
The more complex a function, the more variance in fitting that function (because there is more uncertainty around the fitted parameters). However fitting a simpler function can introduce more bias because there may be a more fundamental difference between the predicted and actual values.
Prediction accuracy is estimated from the same sample that was used to fit the function, two strategies are used to offset the bias that this would introduce:
When the output is a numeric variable, supervised learning can be referred to as regression, some examples of regression are:
glm
)When the output is a categorical variable supervised learning is known as classification
When there is no output variable the problem is usually one of unsupervised Learning a common example is clustering data, for example:
In this example the goal would be to classify the data according to the colours.
In this unit we will cover: