PySpark SQL - Nested array conditional select into a new column

Question

I have the following schema:

root 
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

My event_params is an array of structs. Sample Data:

{
  "event_params": [
    {
      "element": {
        "value": {
          "string_value": "LoginVC",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "firebase_screen_class"
      }
    },
    {
      "element": {
        "value": {
          "string_value": null,
          "float_value": null,
          "double_value": null,
          "int_value": 3600000
        },
        "key": "engagement_time_msec"
      }
    },
    {
      "element": {
        "value": {
          "string_value": "app_entered_background",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "item_name"
      }
    }
  ]
}

How do I create a new column in the same row level with value from value.string_value where "key": "item_name". I do not want to filter rows since I want to repeat the process for two more keys.

So I want a new schema something like this:

root 
|-- item_name_string_value: string (nullable = true)
|-- firebase_screen_class_string_value: string (nullable = true)
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

I want to achieve this using PySpark.

@mck I was trying to do something similar to https://stackoverflow.com/questions/43756940/spark-sql-nested-array-conditional-select but I could not convert it to a pyspark since solution was in scala. — akpatch, Apr 27 '21 at 19:24
@mck also looked at explode option but couldnt extract value based on condition — akpatch, Apr 27 '21 at 19:27
your target schema doesn't make sense. You can't have both a string and struct inside an array — mck, Apr 27 '21 at 19:35
@mck I want event_params and item_name_string_value at the same level. I want to extract the value based on the key condition from the array of structs and bring it to root level. — akpatch, Apr 27 '21 at 19:37

score 0 · Answer 1 · answered Apr 27 '21 at 21:02

This seems to be working for me from pyspark: filtering and extract struct through ArrayType column:

curDFfil = spark.sql("select event_params from temp_data_l1 ")

df = curDFfil.select(expr("filter(event_params, s -> s.key == 'item_name')").getItem(0).alias('item_name'))
newDf = df.select(col("item_name.value.string_value").alias('item_name_string_value'))
newDf.show(10, False)

PySpark SQL - Nested array conditional select into a new column

1 Answers1