Finding top words per kmeans cluster

Question

I have following section of code that maps the TFIDF for a collection of tweets onto original words, which are then used to find top words in each cluster:

#document = sc.textFile("<text file path>").map(lambda line: line.split(" "))
#"tfidf" is an rdd of tweets contained in "document"
#map tfidf to original tweets and cluster similar tweets
clusterIds = clusters.predict(tfidf)
mapped_value = clusterIds.zip(document)
cluster_value = mapped_value.reduceByKey(lambda a,b: a+b).take(cluster_num)


#Fetch the top 5 words from each cluster
topics = []
for i in cluster_value:
    word_count = sc.parallelize(i[1])
    topics.append(
        word_count.map(lambda x: (x,1))
            .reduceByKey(lambda x,y: x+y)
            .takeOrdered(5, key=lambda x: -x[1]))

Is there a better way to do this? I see on Spark UI that my code takes some 70 min when doing reduceByKey() operation on a cluster of 4 VMs with 20.5 Gb of executor memory and 2 gb of driver memory. Number of tweets is 500K. Text file size of 31 Mb post processing for stop words and junk characters.

score 3 · Accepted Answer · edited May 23 '17 at 12:29

Since you didn't provide a Minimal, Complete, and Verifiable example I can only assume that document rdd contains tokenized text. So lets create a dummy example:

mapped_value = sc.parallelize(
    [(1, "aabbc"), (1, "bab"), (2, "aacc"), (2, "acdd")]).mapValues(list)
mapped_value.first()
## (1, ['a', 'a', 'b', 'b', 'c'])

One thing you can do is to aggregate all clusters at the same time:

from collections import Counter

create_combiner = Counter

def merge_value(cnt, doc):
    cnt.update(Counter(doc))
    return cnt

def merge_combiners(cnt1, cnt2):
    cnt1.update(cnt2)
    return cnt1

topics = (mapped_value
    .combineByKey(create_combiner, merge_value, merge_combiners)
    .mapValues(lambda cnt: cnt.most_common(2)))

topics
## [(1, [('b', 4), ('a', 3)]), (2, [('a', 3), ('c', 3)])]

You can further improve on that by replacing Counter with a plain dict and counting / updating manually but I don't think it is worth all the fuss.

What are the gains?

first of all you reduce amount of data that has to be moved (serialized - transfered - deserialized). In particular you don't collect just to send data back to the workers.

Collecting and sending is expensive so you should avoid it unless it is the only option. If aggregation on a whole dataset is to expensive a preferable approach could be a repeated filter equivalent to something like this:
```
[rdd.filter(lambda (k, v): k == i).map(...).reduce(...)
    for i in range(number_of_clusters)]
```
you start only one job not a job per cluster and starting a job is not cheap (See my answer to Spark MLLib's LassoWithSGD doesn't scale? for an example). How much you can gain here depends on a number of clusters.
since data is not flattened there is simply less to do. Concatenating lists gives you nothing and requires a lot of copying. Using dictionaries can reduce amount of stored data an updating in place requires no copies. You can try to improve even more by adjusting merge_value:
```
def merge_value(cnt, doc):
    for v in doc:
        cnt[v] += 1
    return cnt1
```

Side notes:

with 30 MB of data and 20.5 GB of memory I wouldn't bother with Spark at all. Since k-means requires very little additional memory you can create multiple models in parallel locally at much lower cost.

Finding top words per kmeans cluster

1 Answers1