I am trying to translate below code from pyspark to scala. I am able to successfully create the dataframes from input data.
from pyspark.sql.functions import col, array, when, array_remove, lit, size, coalesce
from pyspark.sql.types import *
data1 = [("James","rob","Smith","36636","M",3000),
("Michael","Rose","jim","40288","M",4000),
("Robert","dunkin","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","60563","F",-1)
]
data2 = [("James","rob","Smith","36636","M",3000),
("Robert","dunkin","Williams","42114","M",2000),
("Maria","Anne","Jones","72712","F",3000),
("Yesh","Reddy","Brown","75234","M",3000),
("Jen","Mary","Brown","60563","F",-1)
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df1 = spark.createDataFrame(data=data1,schema=schema)
df2 = spark.createDataFrame(data=data2,schema=schema)
but while converting the below code to scale, it is throwing compile error as illegal start of simple expression
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]
I am suspecting expressions not supported with scala, but I am new to scala hence not sure about this.
Here is the rest of code where conditions is used to create a select expression.
status = when(df1["id"].isNull(), lit("added")).when(df2["id"].isNull(), lit("deleted")).when(size(array_remove(array(*conditions_), "")) > 0, lit("updated")).otherwise("unchanged")
select_expr =[
col("firstname"),col("middlename"),col("lastname"),
*[coalesce(df2[c], df1[c]).alias(c) for c in df2.columns if c not in ['firstname','middlename','lastname']],
array_remove(array(*conditions_), "").alias("updated_columns"),
status.alias("status"),
]
Even status line is throwing a compile error identifier expected but string literal found. at df1["id"] perhaps scala doesn't support column reference like this.
please let me know what am I doing wrong here.