I was doing UNION ALL two table in Data bricks Spark Scala. But when I was writing the resultant dataframe into a delta table it is so much of time to complete. Below is the code spinet I was do.
val df1=spark.sql(
s"""
select
nb.*,
facl.id as facl_decn_id,
facl.seq_nbr as facl_decn_seq_nbr,
case when facl.id is not null then concat(substr(nb.open_dttm, 1, 10), ' 00:00:00.000') else cast(null as string) end as eng_dt
from Interim.Interim_nobed nb
left outer join Interim.facl_decn_interim facl_bed on
(nb.id=facl.id and nb.facl_decn_id=facl.id)
where nb.facl_decn_id is not null""")
val df2=spark.sql(
s"""
select
nb.*,
cast(null as int) as facl_decn_id,
cast(null as int) as facl_decn_seq_nbr,
cast(null as string) as eng_dt
from Interim.Interim_nobed nb
where nb.facl_decn_id is null
""")
val df3=df1.coalesce(150)
val df4=df2.coalesce(150)
val df5=df3.unionAll(df4)
Here df5 is holding the final UNION ALL results. Then I tried to write the data frame into delta table. It is taking so much of time at few of tasks at last of a stage.
df5.write.mode("overwrite").format("delta").saveAsTable(s"Interim.set1_interim")
UPDATE: Here Interim.Interim_nobed table is having 30 columns and the key columns are id and facl_decn_id. When I checked the count of this two column using group by I found for column facl_decn_id there were 90% NULL values compared with total count.
And table Interim.facl_decn_interim has only two columns id and bed_day_decn_seq_nbr did group by on both column found seq_nbr having more count for few than other.