K-means Clustering with R

Question

I'm trying to cluster some data using K-means Clustering in R. The data to be clustered is a specific set of features from a sample of tweets. The tweets are labelled as either x or y. An example of the data is shown below, the usernames and IDs are removed, these fields are not used for clustering.

There is a total of 24.6k data items, with roughly 17k labelled y and the rest labelled x. What I would expect after clustering is two clusters, with roughly the corresponding amounts of data in each cluster. However, clustering seems to allocate the biggest majority of all the data to the same cluster, and only a few thousand items to the second cluster. The clustering results are below:

As you can see, almost all of the data is allocated to cluster 2.

I'm not sure what my problem is, it could either be an issue with the structure of my data or with my R implementation.

I have tried various different methods of both clustering and plotting, including ggplot2. This question was of some of use, but my results remained the same.

My R implementation is below. Note that the method of normalisation is taken from this answer. Can anyone point me in the right direction as to why my data is being allocated to the same cluster, even though I have two distinct labels?

Clustering.R

#Imports
library(jsonlite)
library(tm)
library(fpc)

#Includes
source("./Clustering_Functions.R")

#Program 
rawData <- getInput()
clusterData <- filterData(rawData)
clusterData <- scaleData(clusterData)
aCluster <- performClustering(clusterData)
table(rawData$stance, aCluster$cluster)
plotOutput(clusterData, aCluster)

Functions.R

getInput <- function() {
  json_file <- "path/file.json"

  #Set data to dataframe
  frame <- fromJSON(json_file)
  return(frame)
}

#Filter the raw data, remove columns not for clustering
filterData <- function(frame) {
  kcFrame <- frame[c( -3, -4, -9)]
  return (kcFrame)
}

#Scale the columns to uniform data, values 0-100
scaleData <- function(kcFrame) {
  doScale <- function(x) x* 100/max(x, na.rm = TRUE)
  kcFrame <- data.frame(lapply(kcFrame, doScale))
  return (kcFrame)
}

#Apply K-means clustering
performClustering <- function(kcFrame) {
  kc <- kmeans(kcFrame, centers = 2)
  return (kc)
}

#Graph the clusters
plotOutput <- function(kcFrame, kc) {
  plotcluster(kcFrame, kc$cluster)
}

EDIT: I suspect that the problem lies with my data; that there isn't enough of a distinction between label x and y in terms of the features.

score 0 · Answer 1 · answered Jul 05 '17 at 11:37

Your implementations looks fine to me. Please consider, that it may very well be the structure of your data. It is not uncommon to see this kind of behavior. It is often the case that you have a majority and minority class/ cluster - think of it e.g. as one cluster originating from a "Healthy" distribution while the minority originates from the "Unhealthy" distribution (thinking in terms of diseases e.g.).

Also consider, that it is an unsupervised method, thus it just aims at uncovering the biggest difference in terms of the underlying data structure, this does not mean, that it is the relevant difference for your aims. Consider again having patients with a disease and without, if you cluster them with k-means it might very well be, that you will not get clusters according to healthy/ disease but rather male and female.

You could e.g. try to increase the number k or opt for a supervised/ semi-supervised clustering approach (there are quite a few options in R, Google is your friend there).

K-means Clustering with R

1 Answers1