I have some data output source which can only be written to by a specific Python API. For that I am (ab)using foreachPartition(writing_func) from PySpark which works pretty well. I wonder if its possible to somehow update the task metrics - specifically setBytesWritten - at the end of every partition. On the surface it seems impossible to me, for 2 reasons:
- I don't think there is an open py4j gateway in a task context
- TaskMetrics is accessed via ThreadLocal, so even with an open gateway it looks pretty tricky to get the right thread
Does anyone know of existing solution or a workaround?