1_Exercises (Wk 1 - Introduction to Data Science)

Exercises (Part 1)

For iris, heart and groceries data sets:

  1. Explore the variables
  2. List the variables and classify as quantitative or qualitative
  3. Provide a Research Question and identify a target variable
  4. Identify whether or not this would exemplify supervised or unsupervised learning.

iris Data Set

iris <- read.csv(file = "Datasets/iris.csv", header = TRUE, sep = ",")

(1) Explore the varibles

In order to explore the variables use str(), summary() and head()

head(iris,4)
##   Type PW PL SW SL
## 1    0  2 14 33 50
## 2    1 24 56 31 67
## 3    1 23 51 31 69
## 4    0  2 10 36 46
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Type: int  0 1 1 0 1 1 2 2 1 2 ...
##  $ PW  : int  2 24 23 2 20 19 13 16 17 14 ...
##  $ PL  : int  14 56 51 10 52 51 45 47 45 47 ...
##  $ SW  : int  33 31 31 36 30 27 28 33 25 32 ...
##  $ SL  : int  50 67 69 46 65 58 57 63 49 70 ...
summary(iris)
##       Type         PW              PL              SW       
##  Min.   :0   Min.   : 1.00   Min.   :10.00   Min.   :20.00  
##  1st Qu.:0   1st Qu.: 3.00   1st Qu.:16.00   1st Qu.:28.00  
##  Median :1   Median :13.00   Median :44.00   Median :30.00  
##  Mean   :1   Mean   :11.93   Mean   :37.79   Mean   :30.55  
##  3rd Qu.:2   3rd Qu.:18.00   3rd Qu.:51.00   3rd Qu.:33.00  
##  Max.   :2   Max.   :25.00   Max.   :69.00   Max.   :44.00  
##        SL       
##  Min.   :43.00  
##  1st Qu.:51.00  
##  Median :58.00  
##  Mean   :58.45  
##  3rd Qu.:64.00  
##  Max.   :79.00

(2) List the Variables

In this case the variables are:

Variable Name Type of Data Measurement
Type, (presumably species) Categorical The type of flower (i.e. species)
PW, (petal width) Quantitative Linear Distance
SW, (sepal width) Quantitative Linear Distance
PL, (petal length) Quantitative Linear Distance
SL, (sepal length) Quantitative Linear Distance

(3) provide a research Question

A possible research question could be:

  • Can plant species be the Sepal Length and Petal Width.

(4) Is this Supervised or Unsupervised Learning

This research question would be an example of supervised learning because the plant species are known and the model can be trained using already-known output values.

heart dataset

heart <- read.csv(file = "Datasets/heart.csv", header = TRUE, sep = ",")

(1) Explore the varibles

In order to explore the variables use str(), summary() and head()

head(heart,4)
##   X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1  63   1      typical    145  233   1       2   150     0     2.3     3
## 2 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2
## 3 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2
## 4 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3
##   Ca       Thal AHD
## 1  0      fixed   0
## 2  3     normal   1
## 3  2 reversable   1
## 4  0     normal   0
str(heart)
## 'data.frame':    303 obs. of  15 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ AHD      : int  0 1 1 0 0 0 1 0 1 1 ...
summary(heart)
##        X              Age             Sex                ChestPain  
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   asymptomatic:144  
##  1st Qu.: 76.5   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 86  
##  Median :152.0   Median :56.00   Median :1.0000   nontypical  : 50  
##  Mean   :152.0   Mean   :54.44   Mean   :0.6799   typical     : 23  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :303.0   Max.   :77.00   Max.   :1.0000                     
##                                                                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :241.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :246.7   Mean   :0.1485   Mean   :0.9901  
##  3rd Qu.:140.0   3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##                                                                   
##      MaxHR           ExAng           Oldpeak         Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.80   Median :2.000  
##  Mean   :149.6   Mean   :0.3267   Mean   :1.04   Mean   :1.601  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.20   Max.   :3.000  
##                                                                 
##        Ca                 Thal          AHD        
##  Min.   :0.0000   fixed     : 18   Min.   :0.0000  
##  1st Qu.:0.0000   normal    :166   1st Qu.:0.0000  
##  Median :0.0000   reversable:117   Median :0.0000  
##  Mean   :0.6722   NA's      :  2   Mean   :0.4587  
##  3rd Qu.:1.0000                    3rd Qu.:1.0000  
##  Max.   :3.0000                    Max.   :1.0000  
##  NA's   :4

(2) List the Variables

In this case the variables are:

Variable Name Type of Data Measurement
X Categorical (\(\mathbb{N}\)) presumably observation number
Age Quantitative The age of the individual
Sex Categorical The individuals gender
Chestpain Categorical A classification of the type of chest pain
RestBP Quantitative A measurement of Sys. Blood Pressure at rest
Chol Quantitative Cholestrol levels
Fbs Categorical An indicator of whether or not Fasting Blood Sugar is above a threshold
RestECG Categorical An indicator of the type ECG result
MaxHR Quantitative A measurement of the maximum Heart Rate
ExAng Categorical An indicator of whether or not this individual suffered exercise induced angina
oldpeak quantitative A measurement of ECG change induced by exercise
slope categorical An indicator of the slope of the ST segment of an ECG graph
Ca categorical (because it exists in \((\mathbb{N})\) An indicator of how many of the three major blood vessels are revealed by fluroscopy
AHD categorical An indicator of whether or not the individual suffered Atherosclerotic Heart Disease

(3) provide a research Question

A possible research question could be:

  • does MaxHR predict Atherosclerotic Heart Disease independent of age?

(4) Is this Supervised or Unsupervised Learning

This research question would be an example of supervised learning because the incidence of AHD are known and the model can be trained using already-known output values.

groceries dataset

groceries <- read.csv(file = "Datasets/groceries.csv", header = TRUE, sep = ",")

(1) Explore the varibles

In order to explore the variables use str(), summary() and head()

head(groceries[,1:5], 4)
##   frankfurter sausage liver.loaf ham meat
## 1           0       0          0   0    0
## 2           0       0          0   0    0
## 3           0       0          0   0    0
## 4           0       0          0   0    0
  #Restrict the columns of groceries to fit on the page
str(groceries[,1:5])
## 'data.frame':    9835 obs. of  5 variables:
##  $ frankfurter: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sausage    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ liver.loaf : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ham        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meat       : int  0 0 0 0 0 0 0 0 0 0 ...
summary(groceries[,1:5])
##   frankfurter         sausage          liver.loaf            ham         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.05897   Mean   :0.09395   Mean   :0.005084   Mean   :0.02603  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##       meat        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.02583  
##  3rd Qu.:0.00000  
##  Max.   :1.00000

(2) List the Variables

It is necessary to see whether or not the input value is 1/0 or a number, we can check this by doing:

if(sum(groceries>1)==0){
  print("The values are Boolean")
} else{
  print("The values could be categorical or continuous")
}
## [1] "The values are Boolean"

In this case the variables are:

Variable Name Type of Data Measurement
food item categorical whether or not the item needs to be purchased

The subsequent observations (rows) could represent the need for groceries at each week.

(3) provide a research Question

A possible research question could be:

  • Are certain food items more common at during holiday periods,
    • so for example is consumption of processed meats more common, this could be a public health enquiry.
(4) Is this Supervised or Unsupervised Learning

This research question would be an example of unsupervised learning because the pattern of food consumption is not known and the algorithm must ‘learn’ what the patterns are.