I'm trying to cluster some data using K-means Clustering in R. The data to be clustered is a specific set of features from a sample of tweets. The tweets are labelled as either x or y. An example of the data is shown below, the usernames and IDs are removed, these fields are not used for clustering.
There is a total of 24.6k data items, with roughly 17k labelled y and the rest labelled x. What I would expect after clustering is two clusters, with roughly the corresponding amounts of data in each cluster. However, clustering seems to allocate the biggest majority of all the data to the same cluster, and only a few thousand items to the second cluster. The clustering results are below:
As you can see, almost all of the data is allocated to cluster 2.
I'm not sure what my problem is, it could either be an issue with the structure of my data or with my R implementation.
I have tried various different methods of both clustering and plotting, including ggplot2
. This question was of some of use, but my results remained the same.
My R implementation is below. Note that the method of normalisation is taken from this answer. Can anyone point me in the right direction as to why my data is being allocated to the same cluster, even though I have two distinct labels?
Clustering.R
#Imports
library(jsonlite)
library(tm)
library(fpc)
#Includes
source("./Clustering_Functions.R")
#Program
rawData <- getInput()
clusterData <- filterData(rawData)
clusterData <- scaleData(clusterData)
aCluster <- performClustering(clusterData)
table(rawData$stance, aCluster$cluster)
plotOutput(clusterData, aCluster)
Functions.R
getInput <- function() {
json_file <- "path/file.json"
#Set data to dataframe
frame <- fromJSON(json_file)
return(frame)
}
#Filter the raw data, remove columns not for clustering
filterData <- function(frame) {
kcFrame <- frame[c( -3, -4, -9)]
return (kcFrame)
}
#Scale the columns to uniform data, values 0-100
scaleData <- function(kcFrame) {
doScale <- function(x) x* 100/max(x, na.rm = TRUE)
kcFrame <- data.frame(lapply(kcFrame, doScale))
return (kcFrame)
}
#Apply K-means clustering
performClustering <- function(kcFrame) {
kc <- kmeans(kcFrame, centers = 2)
return (kc)
}
#Graph the clusters
plotOutput <- function(kcFrame, kc) {
plotcluster(kcFrame, kc$cluster)
}
EDIT: I suspect that the problem lies with my data; that there isn't enough of a distinction between label x and y in terms of the features.