AWS EMR Spark is creating files on worker nodes

Asked Jun 23 '20 at 03:38

Active Jun 23 '20 at 03:38

Viewed 274 times

I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to oracle tables.

Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.

I am not writing any data to hdfs(/mnt/hdfs) however is that spark is creating blocks and writing data into it. We are going all the operations in memory.

Why Spark is still writing data to data node?

Any specific operation writing data to datanode(HDFS)?

Here is the hdfs dirs created.

*15.4G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812

129G /mnt/hdfs/current 129G /mnt/hdfs*

asked Jun 23 '20 at 03:38

distributed_world

What is your persistence set to? https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence – Saša Zejnilović Jun 23 '20 at 06:37
Hi @SašaZejnilović, persistence level is default(MEMORY_ONLY). Is the data shuffle creating these files? – distributed_world Jun 23 '20 at 06:58
Look at spark ui for the memory usage. It may happen because of big shuffle. – Abhi Jun 23 '20 at 07:46
Is there any way to cleanup these files? – distributed_world Jun 23 '20 at 10:26
@distributed_world did you have any luck with that? – Eli Golin Mar 29 '21 at 06:37
@EliGolin - These file was created by Spark App logs and i have to delete them manually. – distributed_world May 31 '21 at 06:33
can you give more details about the solution? What's spark app logs? And why is spark moving Spark app logs to hdfs? – Nebi M Aydin Aug 10 '23 at 19:37

AWS EMR Spark is creating files on worker nodes

0 Answers0