Removing columns in a nested struct in a spark dataframe using PySpark (details in text)

Question

I know that I've asked a similar question here but that was for row filtering. This time I am trying to drop columns instead. I tried to implement Higher Order Functions such as FILTER and others for a while but could not get it to work. I think what I need is a SELECT Higher Order Function but it doesn't seem to exist. Thank you for your help!

I am using pyspark and I have a dataframe object df and this is what the output of df.printSchema() looks like

root
 |-- M_MRN: string (nullable = true)
 |-- measurements: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Observation_ID: string (nullable = true)
 |    |    |-- Observation_Name: string (nullable = true)
 |    |    |-- Observation_Result: string (nullable = true)

I would like to keep only the 'Observation_ID' or 'Observation_Result' columns in 'measurements'. So currently when I run df.select('measurements').take(2) I get

[Row(measurements=[Row(Observation_ID='5', Observation_Name='ABC', Observation_Result='108/72'),
                   Row(Observation_ID='11', Observation_Name='ABC', Observation_Result='70'),
                   Row(Observation_ID='10', Observation_Name='ABC', Observation_Result='73.029'),
                   Row(Observation_ID='14', Observation_Name='XYZ', Observation_Result='23.1')]),
 Row(measurements=[Row(Observation_ID='2', Observation_Name='ZZZ', Observation_Result='3/4'),
                   Row(Observation_ID='5', Observation_Name='ABC', Observation_Result='7')])]

I would like that after I do the above filtering and run df.select('measurements').take(2) I get

[Row(measurements=[Row(Observation_ID='5', Observation_Result='108/72'),
                   Row(Observation_ID='11', Observation_Result='70'),
                   Row(Observation_ID='10', Observation_Result='73.029'),
                   Row(Observation_ID='14', Observation_Result='23.1')]),
 Row(measurements=[Row(Observation_ID='2', Observation_Result='3/4'),
                   Row(Observation_ID='5', Observation_Result='7')])]

Is there a way to do this in pyspark? Thank you in anticipation for your help!

score 3 · Accepted Answer · answered Apr 29 '20 at 22:56

You could use higher order function transform to select your desired fields and put them in a struct.

from pyspark.sql import functions as F
df.withColumn("measurements",F.expr("""transform(measurements\
,x-> struct(x.Observation_ID as Observation_ID,\
             x.Observation_Result as Observation_Result))""")).printSchema()

#root
 #|-- measurements: array (nullable = true)
 #|    |-- element: struct (containsNull = false)
 #|    |    |-- Observation_ID: string (nullable = true)
 #|    |    |-- Observation_Result: string (nullable = true)

whatif instead of selecting, i need to drop one/two columns out of 100s of columns. is there any way to do so? — martian_rover, Oct 05 '21 at 10:37

score 0 · Answer 2 · answered Aug 31 '21 at 09:01

For anyone looking for an answer that works with older pyspark versions, here is one using udfs:

import pyspark.sql.functions as f
from pyspark.sql.types import ArrayType, LongType, StringType, StructField, StructType

_measurement_type = ArrayType(StructType([
    StructField('Observation_ID', StringType(), True),
    StructField('Observation_Result', StringType(), True)
]))

@f.udf(returnType=_measurement_type)
def higher_order_select(measurements):
    return [(m.Observation_ID, m.Observation_Result) for m in measurements]

df.select(higher_order_select('measurements').alias('measurements')).printSchema()

prints

root
 |-- measurements: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Observation_ID: string (nullable = true)
 |    |    |-- Observation_Result: string (nullable = true)

Removing columns in a nested struct in a spark dataframe using PySpark (details in text)

2 Answers2