1

In Streaming have set these parameters as below spark.worker.cleanup.enabled true spark.worker.cleanup.interval 60 spark.worker.cleanup.appDataTtl 90

This clears out already killed spark batch/streaming jobs data in work/app-2016*/(1,2,3,4,5,6,...) folders. But on running Spark Streaming job the history data in the current app-* is not deleted. Since we are using Kafka-Spark connector jar,for every micro batch it copies this jar with app jar and stderr,stdout results on each folders(work/app-2016*/(1,2,3,4,5,6,...) . This itself is eating up lot of memory as Kafka-Spark connector is an uber jar and is around 15 MB and in a day it coming to 100 GB .

Is there a way to delete data from current running Spark Streaming job or we should do some scripting for that...?

Santosh B
  • 51
  • 2
  • *Since we are using Kafka-Spark connector jar, for every micro batch it copies this jar with app jar and stderr,stdout results on each folders* That makes no sense. The JAR should only be copied once at the beginning of job submittion, not for every micro-batch. Perhaps you're seeing log files blow up? – Yuval Itzchakov Mar 23 '16 at 19:43
  • Nope its adding Kafka-Spark jar in each directory of each microbatch and we are using PySpark. – Santosh B Mar 24 '16 at 10:09

0 Answers0