This script creates an accumulator accumulate_dict
. It appears to work well, for example: inc("foo")
followed by inc("foo")
updates the accumulator to Accumulator<id=1, value={'foo': 2}>
. But when I run the last line which is running it on an rdd, it fails with the error: File "<stdin>", line 6, in addInPlace
TypeError: unhashable type: 'dict'
. Does PySpark try to hash the accumulator in some way? How can I update this dictionary using accumulator?
from pyspark import AccumulatorParam, SparkContext
rdd = sc.parallelize(["foo", "bar", "foo", "foo", "bar"])
class SAP(AccumulatorParam):
def zero(self, value):
return value.copy()
def addInPlace(self, v1, v2):
v3 = dict(v1)
v3[v2] = v3.get(v2, 0) + 1
return v3
accumulate_dict = sc.accumulator({}, SAP())
def inc(x):
global test
test += x
rdd.foreach(inc)
References:
Source code for pyspark.accumulators
Stackoverflow question