ScalaSpark - Difference between 2 dataframes - Identify inserts, updates and deletes

Question

I am trying to translate below code from pyspark to scala. I am able to successfully create the dataframes from input data.

from pyspark.sql.functions import col, array, when, array_remove, lit, size, coalesce
    from pyspark.sql.types import *
    
    data1 = [("James","rob","Smith","36636","M",3000),
        ("Michael","Rose","jim","40288","M",4000),
        ("Robert","dunkin","Williams","42114","M",4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","60563","F",-1)
      ]
    
    data2 = [("James","rob","Smith","36636","M",3000),
        ("Robert","dunkin","Williams","42114","M",2000),
        ("Maria","Anne","Jones","72712","F",3000),
        ("Yesh","Reddy","Brown","75234","M",3000),
        ("Jen","Mary","Brown","60563","F",-1)
      ]
    schema = StructType([
        StructField("firstname",StringType(),True),
        StructField("middlename",StringType(),True),
        StructField("lastname",StringType(),True),
        StructField("id", StringType(), True),
        StructField("gender", StringType(), True),
        StructField("salary", IntegerType(), True)
      ])
     
    df1 = spark.createDataFrame(data=data1,schema=schema)
    df2 = spark.createDataFrame(data=data2,schema=schema)

but while converting the below code to scale, it is throwing compile error as illegal start of simple expression

conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]

I am suspecting expressions not supported with scala, but I am new to scala hence not sure about this.

Here is the rest of code where conditions is used to create a select expression.

status = when(df1["id"].isNull(), lit("added")).when(df2["id"].isNull(), lit("deleted")).when(size(array_remove(array(*conditions_), "")) > 0, lit("updated")).otherwise("unchanged")

select_expr =[
                col("firstname"),col("middlename"),col("lastname"),
                *[coalesce(df2[c], df1[c]).alias(c) for c in df2.columns if c not in ['firstname','middlename','lastname']],                
                array_remove(array(*conditions_), "").alias("updated_columns"),
  

          status.alias("status"),

]

Even status line is throwing a compile error identifier expected but string literal found. at df1["id"] perhaps scala doesn't support column reference like this.

please let me know what am I doing wrong here.

You should be more specific. What exactly didn't work for you? — Gabio, Oct 17 '22 at 14:07
@Gabio i have updated the question with specifics, thanks for looking into it. — sinsom, Oct 19 '22 at 18:31
In Scala, in order to access a specifc column you should use `df("col_name")`. Please add your Scala code and attach the error you get, otherwise it would be difficult to help here. — Gabio, Oct 19 '22 at 19:45
Thanks, I have broke the question into sub questions here: https://stackoverflow.com/questions/74134351/coalesce-dynamic-column-list-from-two-datasets — sinsom, Oct 20 '22 at 04:12

ScalaSpark - Difference between 2 dataframes - Identify inserts, updates and deletes

0 Answers0