For iris
, heart
and groceries
data sets:
iris
Data Setiris <- read.csv(file = "Datasets/iris.csv", header = TRUE, sep = ",")
In order to explore the variables use str()
, summary()
and head()
head(iris,4)
## Type PW PL SW SL
## 1 0 2 14 33 50
## 2 1 24 56 31 67
## 3 1 23 51 31 69
## 4 0 2 10 36 46
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Type: int 0 1 1 0 1 1 2 2 1 2 ...
## $ PW : int 2 24 23 2 20 19 13 16 17 14 ...
## $ PL : int 14 56 51 10 52 51 45 47 45 47 ...
## $ SW : int 33 31 31 36 30 27 28 33 25 32 ...
## $ SL : int 50 67 69 46 65 58 57 63 49 70 ...
summary(iris)
## Type PW PL SW
## Min. :0 Min. : 1.00 Min. :10.00 Min. :20.00
## 1st Qu.:0 1st Qu.: 3.00 1st Qu.:16.00 1st Qu.:28.00
## Median :1 Median :13.00 Median :44.00 Median :30.00
## Mean :1 Mean :11.93 Mean :37.79 Mean :30.55
## 3rd Qu.:2 3rd Qu.:18.00 3rd Qu.:51.00 3rd Qu.:33.00
## Max. :2 Max. :25.00 Max. :69.00 Max. :44.00
## SL
## Min. :43.00
## 1st Qu.:51.00
## Median :58.00
## Mean :58.45
## 3rd Qu.:64.00
## Max. :79.00
In this case the variables are:
Variable Name | Type of Data | Measurement |
---|---|---|
Type, (presumably species) | Categorical | The type of flower (i.e. species) |
PW, (petal width) | Quantitative | Linear Distance |
SW, (sepal width) | Quantitative | Linear Distance |
PL, (petal length) | Quantitative | Linear Distance |
SL, (sepal length) | Quantitative | Linear Distance |
A possible research question could be:
This research question would be an example of supervised learning because the plant species are known and the model can be trained using already-known output values.
heart
datasetheart <- read.csv(file = "Datasets/heart.csv", header = TRUE, sep = ",")
In order to explore the variables use str()
, summary()
and head()
head(heart,4)
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1 63 1 typical 145 233 1 2 150 0 2.3 3
## 2 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2
## 3 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2
## 4 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3
## Ca Thal AHD
## 1 0 fixed 0
## 2 3 normal 1
## 3 2 reversable 1
## 4 0 normal 0
str(heart)
## 'data.frame': 303 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thal : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
## $ AHD : int 0 1 1 0 0 0 1 0 1 1 ...
summary(heart)
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:144
## 1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 86
## Median :152.0 Median :56.00 Median :1.0000 nontypical : 50
## Mean :152.0 Mean :54.44 Mean :0.6799 typical : 23
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :303.0 Max. :77.00 Max. :1.0000
##
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
## 3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
##
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.80 Median :2.000
## Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
##
## Ca Thal AHD
## Min. :0.0000 fixed : 18 Min. :0.0000
## 1st Qu.:0.0000 normal :166 1st Qu.:0.0000
## Median :0.0000 reversable:117 Median :0.0000
## Mean :0.6722 NA's : 2 Mean :0.4587
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.0000 Max. :1.0000
## NA's :4
In this case the variables are:
Variable Name | Type of Data | Measurement |
---|---|---|
X | Categorical (\(\mathbb{N}\)) | presumably observation number |
Age | Quantitative | The age of the individual |
Sex | Categorical | The individuals gender |
Chestpain | Categorical | A classification of the type of chest pain |
RestBP | Quantitative | A measurement of Sys. Blood Pressure at rest |
Chol | Quantitative | Cholestrol levels |
Fbs | Categorical | An indicator of whether or not Fasting Blood Sugar is above a threshold |
RestECG | Categorical | An indicator of the type ECG result |
MaxHR | Quantitative | A measurement of the maximum Heart Rate |
ExAng | Categorical | An indicator of whether or not this individual suffered exercise induced angina |
oldpeak | quantitative | A measurement of ECG change induced by exercise |
slope | categorical | An indicator of the slope of the ST segment of an ECG graph |
Ca | categorical (because it exists in \((\mathbb{N})\) | An indicator of how many of the three major blood vessels are revealed by fluroscopy |
AHD | categorical | An indicator of whether or not the individual suffered Atherosclerotic Heart Disease |
A possible research question could be:
This research question would be an example of supervised learning because the incidence of AHD are known and the model can be trained using already-known output values.
groceries
datasetgroceries <- read.csv(file = "Datasets/groceries.csv", header = TRUE, sep = ",")
In order to explore the variables use str()
, summary()
and head()
head(groceries[,1:5], 4)
## frankfurter sausage liver.loaf ham meat
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
#Restrict the columns of groceries to fit on the page
str(groceries[,1:5])
## 'data.frame': 9835 obs. of 5 variables:
## $ frankfurter: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sausage : int 0 0 0 0 0 0 0 0 0 0 ...
## $ liver.loaf : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ham : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meat : int 0 0 0 0 0 0 0 0 0 0 ...
summary(groceries[,1:5])
## frankfurter sausage liver.loaf ham
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.05897 Mean :0.09395 Mean :0.005084 Mean :0.02603
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## meat
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.02583
## 3rd Qu.:0.00000
## Max. :1.00000
It is necessary to see whether or not the input value is 1/0 or a number, we can check this by doing:
if(sum(groceries>1)==0){
print("The values are Boolean")
} else{
print("The values could be categorical or continuous")
}
## [1] "The values are Boolean"
In this case the variables are:
Variable Name | Type of Data | Measurement |
---|---|---|
food item | categorical | whether or not the item needs to be purchased |
The subsequent observations (rows) could represent the need for groceries at each week.
A possible research question could be:
This research question would be an example of unsupervised learning because the pattern of food consumption is not known and the algorithm must ‘learn’ what the patterns are.