Context: In our data pipeline, we use spark SQL to run lots of queries that are supplied from our end users as text files that we then parameterise.
Situation:
Our queries look like:
INSERT OVERWRITE TABLE ... PARTITION (...)
SELECT
stuff
FROM
sometable
Issue is, when you look at the results of this, rather than creating a bunch of files that are the size of the max block size, it creates 200 small files (because by default spark creates 200 partitions). (For some queries, depending on the input data and the SELECT
query, for 200 hundred read a bazillion). Lots of small files makes us unpopular with our sysadmins.
Attempted fix (that doesn't work)
Plenty of documentation suggests that in this situation you should use DISTRIBUTE BY
in order to ensure that all the data for a given partition goes to the same partition, so let's try something like:
INSERT OVERWRITE TABLE ... PARTITION (...)
SELECT
stuff
FROM
sometable
DISTRIBUTE BY
1
So why doesn't this work (tested on spark 2.0 and spark 2.2)? It does successfully send all of the data to one reducer - all the actual data is in one big file. But it still creates 200 files of which 199 of them are empty! (I'm aware that we should probably DISTRIBUTE BY
our partition columns, but this is to provide the simplest possible example)
Fix that does work, but doesn't fit our use case
It is possible to get this to do the right thing by using coalesce
or partition
, thus (in pyspark
syntax):
select = sqlContext.sql('''SELECT stuff FROM sometable''').coalesce(1)
select.write.insertInto(target_table, overwrite=True)
But I don't want to do things this way, as we need to totally change the way that users provide us with queries.
I've also seen that we could set:
conf.set("spark.sql.shuffle.partitions","1");
but I haven't tried this, as I don't want to force all calculations in (fairly complex) queries to happen on one reducer, only the one that does the final write to disk. (If I shouldn't worry about this, then let me know!)
The question:
- Using only spark SQL syntax, how do I write a query that writes as few files as possible, and doesn't create lots of empty / small files?
Possibly related:
- merge-multiple-small-files-into-few-larger-files-in-spark (doesn't have the restriction that the solution has to be SparkSQL, and as per the above,
DISTRIBUTE BY
doesn't actually work) - spark coalesce doesn't work (only for us it does, so that's not the problem)