1

I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to oracle tables.

Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.

I am not writing any data to hdfs(/mnt/hdfs) however is that spark is creating blocks and writing data into it. We are going all the operations in memory.

Why Spark is still writing data to data node?

Any specific operation writing data to datanode(HDFS)?

Here is the hdfs dirs created.

*15.4G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812

129G /mnt/hdfs/current 129G /mnt/hdfs*

0 Answers0