0

I am running a Spark application using SparkSQL. How do I merge small files? I know about .repartition and .coalesce but this can't be done using SparkSQL.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
justcode
  • 108
  • 6
  • `spark.sql` returns a dataframe, which can indeed be coalesced and repartitioned before written to a different location – OneCricketeer Oct 11 '18 at 04:44
  • How would I do this if the sql inside is a CTAS? `spark.sql("create table as select....")` – justcode Oct 11 '18 at 04:53
  • What size files is that making currently? How many files? – OneCricketeer Oct 11 '18 at 05:26
  • 1-5MB files and it is generating 20000 files – justcode Oct 11 '18 at 05:34
  • I think the general recommendation would be to use `spark.sql`, get a dataframe, then `df.write` after a coalesce, to output some Parquet (or ORC) data, then run `create external table` with the location you wrote to – OneCricketeer Oct 11 '18 at 05:46
  • Yes that's exactly what I'm doing. I was just curious how to do it directly using spark.sql – justcode Oct 11 '18 at 06:10
  • I don't really know. There is a whole bunch of `hive.merge.*` properties. Have you tried them as per what the other question shows? https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties – OneCricketeer Oct 11 '18 at 06:51
  • df.write is part of Spark SQL uiatm. – thebluephantom Oct 11 '18 at 07:46
  • If you need to do this incrementally, then you need to consider that as well. You get to point that imo only the newer data needs to merged, not all the old data as well. I think cricket_007 alludes to that as well. – thebluephantom Oct 11 '18 at 09:16

1 Answers1

0

Excerpts from DeepSense engineering blog (2016)

Distribute by and cluster by clauses are really cool features in SparkSQL. Unfortunately, this subject remains relatively unknown to most users
...

SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY key

Equivalent in DataFrame API:
df.repartition($"key", 2)

...


Caveat: I cannot testify that it works as advertised; it looked very promising when I found that blog, but it has stayed on my to-do list ever since   :-/

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36