Material of Tue 13 May 2019, week 11
The kmeans()
function performs \(k\)-Means Clustering in R, so first create some clustered data:
set.seed(2)
x <- matrix(rnorm(50*2), ncol = 2)
x[1:25, 1] <- x[1:25, 1] + 3
x[1:25, 2] <- x[1:25, 2] - 4
This will mean that half of \(X_1\) is 3 higher and half of \(X_2\) is 4 lower.
Now in order to perform \(k\)-menas clustering with \(K=2\):
km.out <- kmeans(x, 2, nstart = 20)
The assignments of the 50 observatoins are contained in $cluster
:
km.out$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
This can be plotted:
colnames(x) <- c("X1", "X2")
x2ClustPred <- as_tibble(x)
x2ClustPred$K2Pred <- factor(km.out$cluster, levels = c(1, 2), ordered = FALSE)
ggplot(x2ClustPred, aes(x = X1, y = X2, col = K2Pred)) +
geom_point( size = 4) +
theme_bw() +
labs(col = "Predicted\nClass", main = "Classification Prediction")
If there were more dimensions, it would then be appropriate to use PCA to reduce it to 2 dimensions and then plot the clusters like that.
In this case we knew there would be two groups because we created it like that, however when ordinarily performing unsupervised learning this wouldn’t be known.
Trying three groups:
set.seed(4)
km3mod <- kmeans(x, 3, nstart = 20)
#km3mod
This can be plotted by performing:
colnames(x) <- c("X1", "X2")
x3ClustPred <- as_tibble(x)
x3ClustPred$K3Pred <- factor(km3mod$cluster, levels = c(1, 2, 3), ordered = FALSE)
ggplot(x3ClustPred, aes(x = X1, y = X2, col = K3Pred)) +
geom_point( size = 4) +
theme_bw() +
labs(col = "Predicted\nClass", main = "Classification Prediction")
In this case the algorithm chooses points in between to represent the third class.
To run the kmeans()
multiplie times with a different initial cluster the nstart
parameter is used.
If a value of nstart
greater than one is used then the clustering will be performed using multiple random assignments in step 1 of the K-means algorighm (p. 388) and the function will report only the best results, thes results are more likely to correspond to the minimum RSS, which is difficult to find.
So for example we can repeat the k-means and choose the distribution that most corresponds to the model that minimises the RSS by choosing a sufficiently large nstart
value:
set.seed(3)
km.out <- kmeans(x, 5, nstart = 1)
km.out$tot.withinss
## [1] 64.16453
km.out <- kmeans(x, 5, nstart = 50)
km.out$tot.withinss
## [1] 50.89555
In this case observe that that be reiterating multiple times the model corresponding to the minimal RSS was found; because ‘nstart’ is the length of X we know it must be the minimum RSS because it tried every possible model.
Remember to set the seed because otherwise the initial cluster assignments cannot be replicated and the K-means may not be fully reproducable.