merge multiple small files in to few larger files in Spark

Question

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.

'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")

 val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")

val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")

val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")

val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")

val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")

val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'

The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?

so you are currently getting many files per day partition, and you just want a few? or do you want merge multiple days into a single file? — maxymoo, Jun 24 '15 at 04:16
@maxymoo I am getting many small files(kb files) per day partition, and i just want a few large files(in mbs). — dheee, Jun 24 '15 at 17:14
Hi @user3267086 could you solve this problem of small files? I tried to use hc.sql("bla bla").coalesce(10) method but it does not thing I still see 200 small small files of around 20 MB — Umesh K, Jul 07 '15 at 17:19

score 9 · Answer 1 · answered Nov 06 '15 at 11:16

9

I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:

INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date

answered Nov 06 '15 at 11:16

Jussi Kujala

901
9
7

Hey @Jussi Kujala thanks a lot this seems to work for me. But I had one question, what if my table is partitioned by more than one column... will this work in that case also? – Hemakshi Sachdev Jun 13 '18 at 07:49
Where does the `DISTRIBUTE BY` syntax come from? I cannot find it in the Spark's documentation. – botchniaque May 09 '23 at 10:14

score 1 · Accepted Answer · answered Jul 04 '15 at 21:59

1

You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.

answered Jul 04 '15 at 21:59

zweiterlinde

14,557
2
27
32

Hi @zweiterlinde I tried to use hc.sql("bla bla").coalesce(10) method but it does not thing I still see 200 small small files of around 20 MB – Umesh K Jul 07 '15 at 17:18
1

I'd need a longer code sample to really be able to comment, but in my toy experiments calling `df.write.parquetFile(...)` resulted in many part files, but `df.coalesce(1).write.parquetFile(...)` resulted in one. – zweiterlinde Jul 07 '15 at 17:42

score 0 · Answer 3 · answered Sep 07 '18 at 08:56

0

The dataframe repartition(1) method works in this case.

answered Sep 07 '18 at 08:56

gszecsenyi

93
9

merge multiple small files in to few larger files in Spark

3 Answers3

Linked

Related