(Wk 1) Introduction to Data Science

Material of 5 March 2019, week 1

Types of Data

Data is classified as either structured or unstructured:

Structured Data
- Quantitative/Numeric Data
  - Height, Weight, Salary etc.
- Qualitative Data (also known as Categorical Data, Factors, or Discrete Variables)
  - be careful, factors usually refer to the variables in a predictive model
  - When dealing with factors in it is necessary to use a data type called factors discussed below
    - Examples of categorical data include alive/dead, male/femaile, ethnicity, product code, hair colour etc.

Categorical variables in

In order to deal with discrete variables uses a data type called factors, to create factors the factor() command is used, within this command a vector containing factor levels must be enclosed, e.g.

factor(c("Male", "Female"))

## [1] Male   Female
## Levels: Female Male

Categorical Variables and Regression

When performing a multiple linear regression with categorical data, the cqategorical data will be treated as a boolean 1/0 variable, basically choosing between different categorical variables is choosing a different intercept for the regression (i.e. adding a constant value)

_{under the hood 1 corresponds to True and 0 corresponds to false}

Not as accurate

This is not as accurate as using a linear regression seperately on the data within that category, so, unless there is a good reason i.e. the different categories would have a trend with the same rate but a different intercept, e.g. the ambient temperature of a location would have different mean values for each month:

\[ \text{Response} = -74 = 3.109 \cdot X_{\text{wind}} - 1.875\cdot X_{\text{temp}} + 14.76 \cdot \left( \text{Jan} \right) + 8.75 \cdot \left( \text{Feb} \right) +4.197 \cdot \left( \text{Mar}) \right) \]

in this case $\text{Jan}$, $\text{Feb}$ etc. would be a True/False ($\equiv$ 1/0) value indicating whether or not to include that constant in the equation (because it is or isn’t that date) so if in this example Sydney is an everage of 23 degrees, maybe January is on average 14 degrees hotter and the coefficient for July might be -14 because in July it is 14 degrees cooler.

Supervised vs Unsupervised Problems

Machine Learning problems are often split into two categories, supervised and unsupervised.

Supervised Learning involves data where each observational unit has one special variable, (e.g. survive/perish, amount spent).

Unsupervised Learning Is about pattern discovery, there is no clear special variable (e.g. trying to detect unusual spending or grouping spenders into seperate groups).

Supervised Learning

Supervised Learning has a response variable (also known as outcome or in calculus as an independent variable, $Y$) and the idea is to predicting the relationship between the output and several inputs:

Input factors Create a Function/Model $f$ that Predicts Output

In a classification problem $Y$ takes the value of a discrete variable (e.g. survived/perished), In a regression problem $Y$ takes a $\mathbb{R}$ quantitative value (e.g. $\$72.45$ or $55 \text{ kg}$)

In Mathematical terms, if $y$ was the output and $x_1, x_2, x_3 \dots$ the input, the model would be the expected value of $y$ given the inputs, denoted $E(y)$, defined by some function $f$:

\[ E(y) = f(x_1, x_2, x_3, \dots) \]

The observed output is expected to vary in value owing to errors in measurement (so for example even though we have very strong mathematical evidence for a relationship, e.g. $S = \frac{1}{2}at^2)$ we would expect the observed values to contain error.

n <- 100
x <- 1:n
y <- 0.5*9.81*x^2 + rnorm(n, mean = 0, sd = 2000)

##layout(matrix(1:2, ncol =2))

#Using baseplot

  #model <- lm(y ~ poly(x, degree = 2))
  #plot(y~x)
  #Expy <- predict(object = model)
  #lines(Expy)

#Using GGplot2
egdata <- data.frame(height = y, time = x) #create dataframe
ggplot(data = egdata, aes(x = time, y = height)) + # call ggplot2
  geom_point(col = "#7f7caf", size = 3, alpha = 0.5) + #plot the points for the data
  stat_smooth(col = "#7d2e68", method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) + # draw the polynomial model
  theme_classic() # change the theme

  # use https://coolors.co/app for colour themes

Types of Errors (Stochastic Trends)

Random Error
- unforseeable fluctuations in Data
Systemic Error
- Shortcomings of the capacity to measure accurately
  - e.g. Measuring using a ruler that is $\pm1 \text{ mm}$

Choosing the right Model

It’s also necessary to choose the right type of model, for example below the function chosen on the left is a simple linear regression, but maybe it’s appropriate to assume that there is a seasonal or cyclical trend (a cyclical trend being less predictable like a recession and a seasonal being more predictable like seasons).

Comparison of more complex and simpler models

Bias and Variance

The more complex a function, the more variance in fitting that function (because there is more uncertainty around the fitted parameters). However fitting a simpler function can introduce more bias because there may be a more fundamental difference between the predicted and actual values.

Model Evaluation

Prediction accuracy is estimated from the same sample that was used to fit the function, two strategies are used to offset the bias that this would introduce:

Splitting the Data
- Use training data to create the model
- validation data to validate the model accuracy
- testing data to measure the accuracy of the model predictions
Cross Validation
- Cross Validation only tests the modelling process while splitting the data evaluates the final model

Classification and Regression

Regression

When the output is a numeric variable, supervised learning can be referred to as regression, some examples of regression are:

Linear Regression (Simple or Multiple)
Generalised Linear Models (glm)
Neural Networks
- Neural Networks is an example of a non-paramentric method, the above two rely on statistical assumptions

Classification

When the output is a categorical variable supervised learning is known as classification

K-Nearest Neighbours
Generalised Linear Models (Logistic Regression)

Unsupervised Problems

When there is no output variable the problem is usually one of unsupervised Learning a common example is clustering data, for example:

Example of a Clustering Problem In this example the goal would be to classify the data according to the colours.

In this unit we will cover:

Supervised learning:
- Linear Models
- Classification (KNN and Discrimination)
- Classification and Regression Trees
Unsupervised Learning:
- DimensionReduction: Principal Component Analysis
- Clustering: K Means and Hierarchical
Unstrucutred Data:
- Text Mining
Resampling
Visualisation

Topic 1 Notes (Week 1)