Does size of part files play a role for Spark SQL performance

Question

I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark streaming to write date to hdfs in 10 minute intervals, so the size of these files depends on the amount of data we are processing from the upstream). The number of part files would be around 500. I was wondering if the size of these part files/ number of part files would play any role in the spark SQL performance?

I can provide more information if required.

The bigger the files the better - in general. What is less than 2GB? — thebluephantom, Nov 29 '18 at 20:05
The size of each part file is between 750 MB to 2 GB, so i mentioned that the size of the part files wouldnt exceed 2 GB. — user3679686, Nov 29 '18 at 20:13

thebluephantom · Accepted Answer · 2018-11-29T21:06:23.217

HDFS, Map Reduce and SPARK prefer files that are larger in size, as opposed to many small files. S3 also has issues. I am not sure if you mean HDFS or S3 here.

Repartitioning smaller files to a lesser number of larger files will - without getting into all the details - allow SPARK or MR to process less of, but bigger blocks of data, thereby improving the speed of jobs by decreasing the number of map tasks needed to read them in, and reducing the cost of storage due to less wastage and name node contention issues.

All in all, the small files problem of which there is much to read on. E.g. https://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html. Just to be clear, I am a Spark fan.

score 2 · Answer 2 · answered Nov 30 '18 at 11:21

Generally, fewer, larger files are better,

One issue is whether the file can be split, and how.

Files compressed with .gz cannot be split: you have to read from the start to the finish, so at most one worker at a time gets assigned a single file (except near the end of a query & speculation can trigger a second). Use a compression like snappy and all is well
very small files are inefficient as startup/commit overhead dominates
on HDFS, small files put load on the namenode, so the ops team may be unhappy to

Does size of part files play a role for Spark SQL performance

2 Answers2