Since you didn't provide a Minimal, Complete, and Verifiable example I can only assume that document
rdd contains tokenized text. So lets create a dummy example:
mapped_value = sc.parallelize(
[(1, "aabbc"), (1, "bab"), (2, "aacc"), (2, "acdd")]).mapValues(list)
mapped_value.first()
## (1, ['a', 'a', 'b', 'b', 'c'])
One thing you can do is to aggregate all clusters at the same time:
from collections import Counter
create_combiner = Counter
def merge_value(cnt, doc):
cnt.update(Counter(doc))
return cnt
def merge_combiners(cnt1, cnt2):
cnt1.update(cnt2)
return cnt1
topics = (mapped_value
.combineByKey(create_combiner, merge_value, merge_combiners)
.mapValues(lambda cnt: cnt.most_common(2)))
topics
## [(1, [('b', 4), ('a', 3)]), (2, [('a', 3), ('c', 3)])]
You can further improve on that by replacing Counter
with a plain dict
and counting / updating manually but I don't think it is worth all the fuss.
What are the gains?
first of all you reduce amount of data that has to be moved (serialized - transfered - deserialized). In particular you don't collect just to send data back to the workers.
Collecting and sending is expensive so you should avoid it unless it is the only option. If aggregation on a whole dataset is to expensive a preferable approach could be a repeated filter
equivalent to something like this:
[rdd.filter(lambda (k, v): k == i).map(...).reduce(...)
for i in range(number_of_clusters)]
you start only one job not a job per cluster and starting a job is not cheap (See my answer to Spark MLLib's LassoWithSGD doesn't scale? for an example). How much you can gain here depends on a number of clusters.
since data is not flattened there is simply less to do. Concatenating lists gives you nothing and requires a lot of copying. Using dictionaries can reduce amount of stored data an updating in place requires no copies. You can try to improve even more by adjusting merge_value
:
def merge_value(cnt, doc):
for v in doc:
cnt[v] += 1
return cnt1
Side notes:
- with 30 MB of data and 20.5 GB of memory I wouldn't bother with Spark at all. Since k-means requires very little additional memory you can create multiple models in parallel locally at much lower cost.