Spark dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem to work

Question

I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it

DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla"); 
//above query creates orc file at /user/db/a1/20-05-22 
//I want only one part-00000 file at the end of above query so I tried the following and none worked 
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

No matter I use coalesce or repartition above query creates around 200 small files around 20 MBs at the location /user/db/a1/20-05-22. I want only one part0000 file for performance reason when using Hive. I was thinking if I call coalesce(1) then it will create final one part file but it does not seem to happen. Am I wrong?

Like I mentioned I am getting 200 small files instead of just one part file as expected because of coalesce(1) — Umesh K, Jul 10 '15 at 18:06

score 0 · Accepted Answer · answered Jul 10 '15 at 20:01

0

Repartition manages how many pieces of the file are split up when doing the Spark job, however the actual saving of the file is managed by the Hadoop cluster.

Or that's how I understand it. Also you can see the same question answered here: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail.com%3E

This should never matter though, why are you set on a single file? getmerge will compile it together for you if it's just for your own system.

answered Jul 10 '15 at 20:01

ApolloFortyNine

570
2
7

Hi my problem is explained here http://stackoverflow.com/questions/25967961/spark-cut-down-no-of-output-files?lq=1 where it is mentioned multiple small part files can overload namenode. I have tried same thing but it looks like coalesce not reducing part files for DataFrame – Umesh K Jul 10 '15 at 20:50
200 is not going to slow it down at all. Not even a 2000. You're fine. – ApolloFortyNine Jul 11 '15 at 00:52
Let's say my spark job runs everyday and it creates 10 thousands files which are small in size like 20 mb this will cause unnecessary load on hadoop namenode and in few weeks namenode will run out of metastore space if my spark job creates so many small files everyday – Umesh K Jul 11 '15 at 05:54
Hi please help me understand how can namenode won't run out of memory if my spark job creates 10 thousands small files everyday – Umesh K Jul 12 '15 at 11:17
Hi can you please help me with few links I am new to Hadoop and I have been told your Spark job which creates 10k small files everyday would be overkill to namenode in HDFS – Umesh K Jul 13 '15 at 15:47

score 0 · Answer 2 · answered Nov 02 '17 at 05:41

0

df.coalesce(1) worked for me in spark 2.1.1, So anyone seeing this page, don't have to worry like me.

df.coalesce(1).write.format("parquet").save("a.parquet")

answered Nov 02 '17 at 05:41

ruseel

1,578
2
21
41

Spark dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem to work

2 Answers2

Linked