I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to oracle tables.
Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
I am not writing any data to hdfs(/mnt/hdfs) however is that spark is creating blocks and writing data into it. We are going all the operations in memory.
Why Spark is still writing data to data node?
Any specific operation writing data to datanode(HDFS)?
Here is the hdfs dirs created.
*15.4G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
129G /mnt/hdfs/current 129G /mnt/hdfs*