Where do the python functions run in case of python rdd map?

Asked Dec 01 '22 at 16:19

Active Dec 01 '22 at 16:19

Viewed 29 times

I have the documentation example of mapvalues

x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
def f(x): return len(x)
x.mapValues(f).collect()
[('a', 3), ('b', 1)]

My question is where does this mapvalues happen? Is it in a python process started in the off heap defined by spark.executor.memoryOverhead (or spark.executor.pyspark.memory depending upon if the pyspark.memory is defined) or is pyspark able to convert that function to corresponding java that would run in the on-heap in jvm?

asked Dec 01 '22 at 16:19

figs_and_nuts

4,870
2
31
56

I think this thread should answer your question - https://stackoverflow.com/a/61818471/6870955 – Bartosz Gajda Dec 01 '22 at 17:09

Where do the python functions run in case of python rdd map?

0 Answers0