2

Context: In our data pipeline, we use spark SQL to run lots of queries that are supplied from our end users as text files that we then parameterise.

Situation:

Our queries look like:

INSERT OVERWRITE TABLE ... PARTITION (...)

SELECT 
  stuff
FROM
  sometable

Issue is, when you look at the results of this, rather than creating a bunch of files that are the size of the max block size, it creates 200 small files (because by default spark creates 200 partitions). (For some queries, depending on the input data and the SELECT query, for 200 hundred read a bazillion). Lots of small files makes us unpopular with our sysadmins.

Attempted fix (that doesn't work)

Plenty of documentation suggests that in this situation you should use DISTRIBUTE BY in order to ensure that all the data for a given partition goes to the same partition, so let's try something like:

INSERT OVERWRITE TABLE ... PARTITION (...)

SELECT 
  stuff
FROM
  sometable
DISTRIBUTE BY
  1

So why doesn't this work (tested on spark 2.0 and spark 2.2)? It does successfully send all of the data to one reducer - all the actual data is in one big file. But it still creates 200 files of which 199 of them are empty! (I'm aware that we should probably DISTRIBUTE BY our partition columns, but this is to provide the simplest possible example)

Fix that does work, but doesn't fit our use case

It is possible to get this to do the right thing by using coalesce or partition, thus (in pyspark syntax):

select = sqlContext.sql('''SELECT stuff FROM sometable''').coalesce(1)
select.write.insertInto(target_table, overwrite=True)

But I don't want to do things this way, as we need to totally change the way that users provide us with queries.

I've also seen that we could set:

conf.set("spark.sql.shuffle.partitions","1");

but I haven't tried this, as I don't want to force all calculations in (fairly complex) queries to happen on one reducer, only the one that does the final write to disk. (If I shouldn't worry about this, then let me know!)

The question:

  • Using only spark SQL syntax, how do I write a query that writes as few files as possible, and doesn't create lots of empty / small files?

Possibly related:

user3805082
  • 2,076
  • 1
  • 14
  • 9
  • Doesn't answer your question but we had a similar problem. We solved it by using two jobs. One computed the result in spark and wrote it out to a "staging" location. The next job, a triggered bash script, picked up and concatenated the files together and delivered to the necessary location then cleaned up the staging area. The bash script was extremely fast. Not very elegant but definitely worked. – Jeremy Oct 26 '17 at 18:08

4 Answers4

3

Starting from spark 2.4 you can add a hint to the query to coalesce and repartition the final Select. For example:

INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT /*+ REPARTITION(5) */ client_id, country FROM mytable;

This would generate 5 files.

Previously to Spark 2.4, and with a probable impact to the query performance, you could have have set spark.sql.shuffle.partitions to the number of desired files.

Madhava Carrillo
  • 3,998
  • 3
  • 18
  • 24
1

(I'm aware that we should probably DISTRIBUTE BY our partition columns, but this is to provide the simplest possible example)

So it would seem that my attempt to simplify things is where I went wrong. If I DISTRIBUTE BY actual columns rather than the artificial 1 (i.e. DISTRIBUTE BY load_date or whatever) then it doesn't create the empty files. Why? Who knows...

(This also matches this answer on the merge-multiple-small-files-in-to-few-larger-files-in-spark thread)

user3805082
  • 2,076
  • 1
  • 14
  • 9
0

that's been a real annoying problem for a while for me, and took a while for me to get around.

The following 2 approaches worked for me:

  1. Running it externally as a beeline script:
     set hive.exec.dynamic.partition.mode=nonstrict;
     set hive.merge.mapfiles=true;
     set hive.merge.mapredfiles=true;
     set hive.merge.smallfiles.avgsize=64512000;
     set hive.merge.size.per.task=12992400;
     set hive.exec.max.dynamic.partitions=2048;
     set hive.exec.max.dynamic.partitions.pernode=1024;

     <insert overwrite command>

The problem with this approach is this does not work internally from pyspark somehow, and can be run as an external beeline script from within a python script

  1. Using Repartition

I found this option quite a good one. Repartition(x) allows to compress the records of a pyspark dataframe into 'x' files.

Now, as it's not possible to come up with a static number 'x' to repartition every table, as the table sizes vary (eg. I would not want to repartition a table with 10 mil records into 1), i do the following.

-> set an upper threshold for the max number of records a partition should hold 
I use 100,000 

-> compute x as : 
x = df.count()/max_num_records_per_partition
In case the table is partitioned,I use df_partition instead of df...i.e for every set of partition values, i filter df_partition from df; and then compute x from df_partition

-> repartition as:
df = df.repartition(x)
In case if the table is partitioned; i use df_partition = df_partition.repartition(x)

-> insert overwrite dataframe 

This approach works quite conveniently for me.

To go a step further, the number of columns in the table and the datatype used for each column can be used to create weights which can be used for getting a much more efficient estimate for repartition number for a given dataFrame. (for eg. a dataFrame with 20 columns would get a higher weight as compared to a dataFrame with 5 columns of the same type; a dataFrame of 1 column with type Map would get a higher weight as compared to a dataFrame of 1 column with type Boolean

Dharman
  • 30,962
  • 25
  • 85
  • 135
0

two way.
coalesce is better than repartition.
read this topic: https://kontext.tech/article/1155/use-spark-sql-partitioning-hints

INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT /*+ REPARTITION(5) */ client_id, country FROM mytable;
INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT /*+ COALESCE(5) */ client_id, country FROM mytable;
seunggabi
  • 1,699
  • 12
  • 12