Last few tasks of Spark job are taking long time to complete?

Question

I was doing UNION ALL two table in Data bricks Spark Scala. But when I was writing the resultant dataframe into a delta table it is so much of time to complete. Below is the code spinet I was do.

val df1=spark.sql(
s"""
select
nb.*,
facl.id as facl_decn_id,
facl.seq_nbr as facl_decn_seq_nbr,
case when facl.id is not null then concat(substr(nb.open_dttm, 1, 10), ' 00:00:00.000') else cast(null as string) end as eng_dt
from Interim.Interim_nobed nb
left outer join Interim.facl_decn_interim facl_bed on
(nb.id=facl.id and nb.facl_decn_id=facl.id)
where nb.facl_decn_id is not null""")

val df2=spark.sql(
s"""
select
nb.*,
cast(null as int) as facl_decn_id,
cast(null as int) as facl_decn_seq_nbr,
cast(null as string) as eng_dt
from Interim.Interim_nobed nb
where nb.facl_decn_id is null
""")

val df3=df1.coalesce(150)
val df4=df2.coalesce(150)

val df5=df3.unionAll(df4)

Here df5 is holding the final UNION ALL results. Then I tried to write the data frame into delta table. It is taking so much of time at few of tasks at last of a stage.

df5.write.mode("overwrite").format("delta").saveAsTable(s"Interim.set1_interim")

UPDATE: Here Interim.Interim_nobed table is having 30 columns and the key columns are id and facl_decn_id. When I checked the count of this two column using group by I found for column facl_decn_id there were 90% NULL values compared with total count.

And table Interim.facl_decn_interim has only two columns id and bed_day_decn_seq_nbr did group by on both column found seq_nbr having more count for few than other.

how sure are you about the `.coalesce(150)`? it will reshuffle the data and take a long time depending on the size. — samkart, May 26 '22 at 09:42
1. ... facl.id ... what is facl ?. In part FROM we see left outer join Interim.facl_decn_interim facl_bed on ... 2. please show columns in Interim.Interim_nobed nb ,,, nb.* 3. ... where nb.facl_decn_id is not null and if see ... facl.id as facl_decn_id, ... You have two column in dataframe with name "facl_decn_id". it is not good. — mvasyliv, May 26 '22 at 18:25
please, 1. val df1 = spark.table("Interim.Interim_nobed") 2. val df2 = spark.table("Interim.facl_decn_interim") and please, show result df1.printSchema() and df2.printSchema() — mvasyliv, May 26 '22 at 18:30

Last few tasks of Spark job are taking long time to complete?

0 Answers0