0

I have some data output source which can only be written to by a specific Python API. For that I am (ab)using foreachPartition(writing_func) from PySpark which works pretty well. I wonder if its possible to somehow update the task metrics - specifically setBytesWritten - at the end of every partition. On the surface it seems impossible to me, for 2 reasons:

  1. I don't think there is an open py4j gateway in a task context
  2. TaskMetrics is accessed via ThreadLocal, so even with an open gateway it looks pretty tricky to get the right thread

Does anyone know of existing solution or a workaround?

Promila Ghosh
  • 389
  • 4
  • 12
shay__
  • 3,815
  • 17
  • 34

1 Answers1

0

You can use accumulators they are used to help report custom information into the spark UI/REST API for metrics. There is one caveat. If a job is failed and retried then it will over report, but under the happy path they should be a good solution for you.

a.value 7 sc.accumulator(1.0).value
1.0 sc.accumulator(1j).value 1j rdd = sc.parallelize([1,2,3]) def f(x):
    global a
    a += x rdd.foreach(f) a.value 13 b = sc.accumulator(0) def g(x):
    b.add(x) rdd.foreach(g) b.value 6 ```
Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • Thanks, but I'm looking for updating the TaskMetrics specifically :) I need the metrics to be attached to the task – shay__ Jun 20 '22 at 13:01