1

I am training a U-Net model using TensorFlowOnSpark and a dataset of images that can fit in memory on my Spark cluster which has 3 worker nodes(each one is Ubuntu20 with 11 GB memory). Each node has 1 executor and 4 CPUs provided with 9 GB of memory.

When the model training process begins, I can see that there is at least 2GB of free space memory on each of the executors, but the more batches are trained, the more memory usage is used by executors until the whole work fails because of an out-of-memory error.

I tried my code in a single node configuration(1 worker with Spark) and got the same result but my code worked fine when using Distributed TensorFlow on 1 CPU without Spark!!

Command used:

spark-submit --master spark://master:7077 --conf spark.dynamicAllocation.enabled=false --conf spark.cores.max=12 --conf spark.executor.memory=8g --conf spark.executor.memoryOverhead=8192 --conf spark.memory.storageFraction=0.1 u_net_train.py --cluster_size 3

Why does this high increase in memory usage happen and how to solve it?

Orwa kassab
  • 111
  • 1
  • 9

0 Answers0