1

This question has already been addressed but I am not satisfied with the given answers

Reading From Local Folder (3 GB)

enter image description here

enter image description here

Reading From HDFS Folder (3 GB)

enter image description here

enter image description here


As you can see, the longer lasting job (1) is taking sensibly the same processing time. This is pretty counter-intuitive indeed. You would of course expect the distributed files setup to run significantly faster that a local folder
I am using an AWS EMR Cluster (Last version as of posting date)

I would like to have concrete specific insights to look at precisely for my benchmark and not only very general pointing to references to spark theory, architecture, & setting.

EDIT

Here is the file information on HDFS I have 12 of these (One per month), split in two blocks. From my understanding, as I have 2 datanodes, the files are split over these two datanodes (that correspond also to Spark Worker Cores)

enter image description here

Thanks in advance.

Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130
  • 1
    How many partitions does your file have within HDFS? – Michael Heil Oct 07 '20 at 10:23
  • That is a good question by @mike. If your file is copied locally to all worker nodes, then all worker nodes have local access. If your file stored in HDFS is a single partition, then it is not available at all your nodes depending on your replication factor you have set for HDFS. You can see, where the file is located at your cluster in HDFS in the UI of Hadoop. – Simon Schiff Oct 07 '20 at 10:27
  • I think it's an interesting question. I guess most of the time is spent reading the file, do all worker read the entire file or only part of it (not sure)? I think it depends on file format, compression (some compression formats are not splittable) etc. I guess 3GB is also relatively small to get the real benefit from hdfs? – Raphael Roth Oct 07 '20 at 18:37
  • @SimonSchiff I edited with HDFS information. I have 12 Files split over 2 Data Nodes – Mehdi LAMRANI Oct 08 '20 at 10:10
  • @mike Edited with HDFS File info – Mehdi LAMRANI Oct 08 '20 at 10:11
  • @SimonSchiff to my understanding every spark worker should leverage working locally on the part of file that is available to him locally and avoid shuffling when possible, as much as possible. I was expecting to see more tangible effect with 12 files split over 2 datanodes, compared to files stored only on the spark master node – Mehdi LAMRANI Oct 08 '20 at 10:14

0 Answers0