1

Let me explain what I want to do:

Input

A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not important).

SAMPLE ROW

This is currently what a single row of the csv looks like:
(user id, keywords)

"1627498372", " play house business card"  

Goal

Given the input I want to cluster users based on the keywords they use in java so that the different clusters represent somehow users with similar interests, therefore similar keywords usage, without using machine learning techniques, natural language processing or parallelization techniques like MapReduce. I have searched a lot of clustering algorithms libraries on the internet like BIRCH, BFR, CURE, ROCK, CLARANS, etc, but no one of them seems to suit my needs, because either they are for spacial points, or they uses machine learning models, or they struggle with large datasets.

So I am here to ask you if you know of such clustering algorithm names/libraries/reasonably implementable pseudocode (preferably jars) for texts or that can be easily modified to work with strings.

Hope everything is clear.

UPDATE

While I was waiting responses I came upon the scikitlearn library for python, especially minibatchkmeans, I am trying something with it for now... so just as an update, if you find something in python, feel free to share.

  • Are you allowed to extract features from these keywords? i.e. letter count – Oscar Martinez Aug 31 '18 at 14:16
  • If you mean counting the occurences of each keywords, yes I can. –  Aug 31 '18 at 14:48
  • I mean you can extract the number of chars per keyword and make clusters grouping users with similar char count. – Oscar Martinez Aug 31 '18 at 14:53
  • Actually, I forgot to mention it, but the different clusters need to represent somehow users with similar interests, therefore similar keywords usage. So I don't think that simple letter count is the right thing to do. –  Aug 31 '18 at 14:55
  • That's important, and could you post some sample rows? – Oscar Martinez Aug 31 '18 at 14:58
  • Done, even though the csv structure is not so important, I can change it based on the needs of the algorithm. –  Aug 31 '18 at 15:11
  • As an idea, arrange alphabeticaly the keywords (maybe join them all) (per user). Then you can apply some algorithm of string comparisson [(read this)](https://stackoverflow.com/questions/6690739/fuzzy-string-comparison-in-python-confused-with-which-library-to-use) For your example the input could be `businesscardhouseplay` – Oscar Martinez Aug 31 '18 at 15:20
  • Mmhh that could be interesting to exploit as similarity, but the problem still remains, huge datasets and no idea what model to use. Thank you anyway. –  Aug 31 '18 at 15:34

1 Answers1

0

Instead of clustering (how many clusters? What about users that do not fit any cluster?) you should rather consider frequent itemset mining to find popluar combinations of keywords.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194