I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark streaming to write date to hdfs in 10 minute intervals, so the size of these files depends on the amount of data we are processing from the upstream). The number of part files would be around 500. I was wondering if the size of these part files/ number of part files would play any role in the spark SQL performance?
I can provide more information if required.