Unable to run accumulator of type dictionary on rdd in in Pyspark

Question

This script creates an accumulator accumulate_dict. It appears to work well, for example: inc("foo") followed by inc("foo") updates the accumulator to Accumulator<id=1, value={'foo': 2}>. But when I run the last line which is running it on an rdd, it fails with the error: File "<stdin>", line 6, in addInPlace TypeError: unhashable type: 'dict'. Does PySpark try to hash the accumulator in some way? How can I update this dictionary using accumulator?

from pyspark import AccumulatorParam, SparkContext

rdd = sc.parallelize(["foo", "bar", "foo", "foo", "bar"])

class SAP(AccumulatorParam):
    def zero(self, value):
        return value.copy()
    def addInPlace(self, v1, v2):
        v3 = dict(v1)
        v3[v2] = v3.get(v2, 0) + 1
        return v3

accumulate_dict = sc.accumulator({}, SAP())
def inc(x):
    global test
    test += x

rdd.foreach(inc)

References:
Source code for pyspark.accumulators
Stackoverflow question

This is not how accumulator works. `AccumulatorParam.addInPlace` [is](https://github.com/zero323/pyspark-stubs/blob/9af2eda12c1982aa52b2bc910a38b08d93396d2b/third_party/3/pyspark/accumulators.pyi#L37) `def addInPlace(self, value1: T, value2: T) -> T: ...` (both arguments and return type have to be the same - it is `reduce` like operation). See [this](https://stackoverflow.com/q/41398242/). — 10465355, Feb 06 '20 at 14:06
The solution with Counters that you have linked to is perfect for my requirement. Thank you very much — The_Coder, Feb 06 '20 at 14:17
I did try that in case of dictionary where both were dictionaries, but I'll try that again and update my question with that snippet as well — The_Coder, Feb 06 '20 at 14:18

Unable to run accumulator of type dictionary on rdd in in Pyspark

0 Answers0