(Wk 11) Introduction to Data Science

Material of Tue 13 May 2019, week 11

K-Means Clustering

2 Groups

The kmeans() function performs \(k\)-Means Clustering in R, so first create some clustered data:

set.seed(2)
x <- matrix(rnorm(50*2), ncol = 2)
x[1:25, 1] <- x[1:25, 1] + 3
x[1:25, 2] <- x[1:25, 2] - 4

This will mean that half of \(X_1\) is 3 higher and half of \(X_2\) is 4 lower.

Now in order to perform \(k\)-menas clustering with \(K=2\):

km.out <- kmeans(x, 2, nstart = 20)

The assignments of the 50 observatoins are contained in $cluster:

km.out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

This can be plotted:

colnames(x) <- c("X1", "X2")
x2ClustPred <- as_tibble(x)
x2ClustPred$K2Pred <- factor(km.out$cluster, levels = c(1, 2), ordered = FALSE)

ggplot(x2ClustPred, aes(x = X1, y = X2, col = K2Pred)) +
  geom_point( size = 4) +
  theme_bw() + 
  labs(col = "Predicted\nClass", main = "Classification Prediction")

If there were more dimensions, it would then be appropriate to use PCA to reduce it to 2 dimensions and then plot the clusters like that.

Three Groups

In this case we knew there would be two groups because we created it like that, however when ordinarily performing unsupervised learning this wouldn’t be known.

Trying three groups:

set.seed(4)
km3mod <- kmeans(x, 3, nstart = 20)
#km3mod

This can be plotted by performing:

colnames(x) <- c("X1", "X2")
x3ClustPred <- as_tibble(x)
x3ClustPred$K3Pred <- factor(km3mod$cluster, levels = c(1, 2, 3), ordered = FALSE)

ggplot(x3ClustPred, aes(x = X1, y = X2, col = K3Pred)) +
  geom_point( size = 4) +
  theme_bw() + 
  labs(col = "Predicted\nClass", main = "Classification Prediction")

In this case the algorithm chooses points in between to represent the third class.

Best Cluster Size

To run the kmeans() multiplie times with a different initial cluster the nstart parameter is used.

If a value of nstart greater than one is used then the clustering will be performed using multiple random assignments in step 1 of the K-means algorighm (p. 388) and the function will report only the best results, thes results are more likely to correspond to the minimum RSS, which is difficult to find.

So for example we can repeat the k-means and choose the distribution that most corresponds to the model that minimises the RSS by choosing a sufficiently large nstart value:

set.seed(3)
km.out <- kmeans(x, 5, nstart = 1)
km.out$tot.withinss
## [1] 64.16453
km.out <- kmeans(x, 5, nstart = 50)
km.out$tot.withinss
## [1] 50.89555

In this case observe that that be reiterating multiple times the model corresponding to the minimal RSS was found; because ‘nstart’ is the length of X we know it must be the minimum RSS because it tried every possible model.

Remember to set the seed because otherwise the initial cluster assignments cannot be replicated and the K-means may not be fully reproducable.