0

I was doing UNION ALL two table in Data bricks Spark Scala. But when I was writing the resultant dataframe into a delta table it is so much of time to complete. Below is the code spinet I was do.

val df1=spark.sql(
s"""
select
nb.*,
facl.id as facl_decn_id,
facl.seq_nbr as facl_decn_seq_nbr,
case when facl.id is not null then concat(substr(nb.open_dttm, 1, 10), ' 00:00:00.000') else cast(null as string) end as eng_dt
from Interim.Interim_nobed nb
left outer join Interim.facl_decn_interim facl_bed on
(nb.id=facl.id and nb.facl_decn_id=facl.id)
where nb.facl_decn_id is not null""")
val df2=spark.sql(
s"""
select
nb.*,
cast(null as int) as facl_decn_id,
cast(null as int) as facl_decn_seq_nbr,
cast(null as string) as eng_dt
from Interim.Interim_nobed nb
where nb.facl_decn_id is null
""")
val df3=df1.coalesce(150)
val df4=df2.coalesce(150)
val df5=df3.unionAll(df4)

Here df5 is holding the final UNION ALL results. Then I tried to write the data frame into delta table. It is taking so much of time at few of tasks at last of a stage.

df5.write.mode("overwrite").format("delta").saveAsTable(s"Interim.set1_interim")

UPDATE: Here Interim.Interim_nobed table is having 30 columns and the key columns are id and facl_decn_id. When I checked the count of this two column using group by I found for column facl_decn_id there were 90% NULL values compared with total count.

And table Interim.facl_decn_interim has only two columns id and bed_day_decn_seq_nbr did group by on both column found seq_nbr having more count for few than other.

venkat
  • 111
  • 1
  • 1
  • 11
  • how sure are you about the `.coalesce(150)`? it will reshuffle the data and take a long time depending on the size. – samkart May 26 '22 at 09:42
  • The tables are very large so tried with 150. – venkat May 26 '22 at 10:23
  • have you tried the code without the `.coalesce()`? – samkart May 26 '22 at 10:24
  • Yes, even same issue. Also tried with Repartition. – venkat May 26 '22 at 11:15
  • 1. ... facl.id ... what is facl ?. In part FROM we see left outer join Interim.facl_decn_interim facl_bed on ... 2. please show columns in Interim.Interim_nobed nb ,,, nb.* 3. ... where nb.facl_decn_id is not null and if see ... facl.id as facl_decn_id, ... You have two column in dataframe with name "facl_decn_id". it is not good. – mvasyliv May 26 '22 at 18:25
  • please, 1. val df1 = spark.table("Interim.Interim_nobed") 2. val df2 = spark.table("Interim.facl_decn_interim") and please, show result df1.printSchema() and df2.printSchema() – mvasyliv May 26 '22 at 18:30
  • Hi mvasyliv. I have updated question pls check. – venkat May 26 '22 at 18:56

0 Answers0