still new to pyspark but I have a pyspark dataframe that I am trying to manipulate. The data consists of users logging into a device and I'm creating sessions for each user. I want to reset the id for all the array of structs to 0 and increment the id by 1. This is what I have and here's what the schema looks like:
|-- user: string (nullable = true)
|-- logins: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = true) #ids are non unique
| | |-- start_time: timestamp (nullable = true)
|-- end_time: timestamp (nullable = true)
def id_fix():
return lambda x: transform(x, lambda item: struct(lit(0)).alias("id"))
df.withColumn("corrected", transform(col("logins"), id()))
I can't even set the id's to 0 without getting an data type mismatch array as the "paramenter require 1 "ARRAY" type however "namedlambdavariable()" is of "STRUCT" type"
Thank you!