How to consolidate results of a spark SQL query to avoid lots of small files / avoid empty files

Question

Context: In our data pipeline, we use spark SQL to run lots of queries that are supplied from our end users as text files that we then parameterise.

Situation:

Our queries look like:

INSERT OVERWRITE TABLE ... PARTITION (...)

SELECT 
  stuff
FROM
  sometable

Issue is, when you look at the results of this, rather than creating a bunch of files that are the size of the max block size, it creates 200 small files (because by default spark creates 200 partitions). (For some queries, depending on the input data and the SELECT query, for 200 hundred read a bazillion). Lots of small files makes us unpopular with our sysadmins.

Attempted fix (that doesn't work)

Plenty of documentation suggests that in this situation you should use DISTRIBUTE BY in order to ensure that all the data for a given partition goes to the same partition, so let's try something like:

INSERT OVERWRITE TABLE ... PARTITION (...)

SELECT 
  stuff
FROM
  sometable
DISTRIBUTE BY
  1

So why doesn't this work (tested on spark 2.0 and spark 2.2)? It does successfully send all of the data to one reducer - all the actual data is in one big file. But it still creates 200 files of which 199 of them are empty! (I'm aware that we should probably DISTRIBUTE BY our partition columns, but this is to provide the simplest possible example)

Fix that does work, but doesn't fit our use case

It is possible to get this to do the right thing by using coalesce or partition, thus (in pyspark syntax):

select = sqlContext.sql('''SELECT stuff FROM sometable''').coalesce(1)
select.write.insertInto(target_table, overwrite=True)

But I don't want to do things this way, as we need to totally change the way that users provide us with queries.

I've also seen that we could set:

conf.set("spark.sql.shuffle.partitions","1");

but I haven't tried this, as I don't want to force all calculations in (fairly complex) queries to happen on one reducer, only the one that does the final write to disk. (If I shouldn't worry about this, then let me know!)

The question:

Using only spark SQL syntax, how do I write a query that writes as few files as possible, and doesn't create lots of empty / small files?

Possibly related:

merge-multiple-small-files-into-few-larger-files-in-spark (doesn't have the restriction that the solution has to be SparkSQL, and as per the above, DISTRIBUTE BY doesn't actually work)
spark coalesce doesn't work (only for us it does, so that's not the problem)

Doesn't answer your question but we had a similar problem. We solved it by using two jobs. One computed the result in spark and wrote it out to a "staging" location. The next job, a triggered bash script, picked up and concatenated the files together and delivered to the necessary location then cleaned up the staging area. The bash script was extremely fast. Not very elegant but definitely worked. — Jeremy, Oct 26 '17 at 18:08

score 3 · Answer 1 · answered Jan 22 '20 at 13:16

3

Starting from spark 2.4 you can add a hint to the query to coalesce and repartition the final Select. For example:

INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT /*+ REPARTITION(5) */ client_id, country FROM mytable;

This would generate 5 files.

Previously to Spark 2.4, and with a probable impact to the query performance, you could have have set spark.sql.shuffle.partitions to the number of desired files.

answered Jan 22 '20 at 13:16

Madhava Carrillo

3,998
3
18
24

1

Thanks from a Spark 2.4.3 user. This works like a charm : ] – Eric Dec 07 '21 at 03:30

score 1 · Answer 2 · answered Oct 30 '17 at 13:17

(I'm aware that we should probably DISTRIBUTE BY our partition columns, but this is to provide the simplest possible example)

So it would seem that my attempt to simplify things is where I went wrong. If I DISTRIBUTE BY actual columns rather than the artificial 1 (i.e. DISTRIBUTE BY load_date or whatever) then it doesn't create the empty files. Why? Who knows...

(This also matches this answer on the merge-multiple-small-files-in-to-few-larger-files-in-spark thread)

score 0 · Answer 3 · edited Feb 19 '21 at 18:55

that's been a real annoying problem for a while for me, and took a while for me to get around.

The following 2 approaches worked for me:

Running it externally as a beeline script:

     set hive.exec.dynamic.partition.mode=nonstrict;
     set hive.merge.mapfiles=true;
     set hive.merge.mapredfiles=true;
     set hive.merge.smallfiles.avgsize=64512000;
     set hive.merge.size.per.task=12992400;
     set hive.exec.max.dynamic.partitions=2048;
     set hive.exec.max.dynamic.partitions.pernode=1024;

     <insert overwrite command>

The problem with this approach is this does not work internally from pyspark somehow, and can be run as an external beeline script from within a python script

Using Repartition

I found this option quite a good one. Repartition(x) allows to compress the records of a pyspark dataframe into 'x' files.

Now, as it's not possible to come up with a static number 'x' to repartition every table, as the table sizes vary (eg. I would not want to repartition a table with 10 mil records into 1), i do the following.

-> set an upper threshold for the max number of records a partition should hold 
I use 100,000 

-> compute x as : 
x = df.count()/max_num_records_per_partition
In case the table is partitioned,I use df_partition instead of df...i.e for every set of partition values, i filter df_partition from df; and then compute x from df_partition

-> repartition as:
df = df.repartition(x)
In case if the table is partitioned; i use df_partition = df_partition.repartition(x)

-> insert overwrite dataframe

This approach works quite conveniently for me.

To go a step further, the number of columns in the table and the datatype used for each column can be used to create weights which can be used for getting a much more efficient estimate for repartition number for a given dataFrame. (for eg. a dataFrame with 20 columns would get a higher weight as compared to a dataFrame with 5 columns of the same type; a dataFrame of 1 column with type Map would get a higher weight as compared to a dataFrame of 1 column with type Boolean

score 0 · Answer 4 · answered Feb 04 '23 at 09:48

two way.
coalesce is better than repartition.
read this topic: https://kontext.tech/article/1155/use-spark-sql-partitioning-hints

INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT /*+ REPARTITION(5) */ client_id, country FROM mytable;

INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT /*+ COALESCE(5) */ client_id, country FROM mytable;

How to consolidate results of a spark SQL query to avoid lots of small files / avoid empty files

4 Answers4