I have the following spark df
id | country
------------------
1 | Null
2 | {"date": null, "value": "BRA", "context": "nationality", "state": null}
3 | {"date": null, "value": "ITA", "context": "residence", "state": null}
4 | {"date": null, "value": null, "context": null, "state": null}
And I want to create a pandas user defined function that, when run like below, would output the df like shown below:
(i'm working in databricks notebooks, the display function simply prints at the console the output of the command within parens)
display(df.withColumn("country_context", get_country_context(col("country"))))
would output
id | country | country_context
-----------------------------------
1 | Null | null
2 | {"date": n...| nationality
3 | {"date": n...| residence
4 | {"date": n...| null
The pandas_udf I created is the following:
from pyspark.sql.functions import pandas_udf, col
import pandas as pd
@pandas_udf("string")
def get_country_context(country_series: pd.Series) -> pd.Series:
return country_series.map(lambda d:
d.get("context", "Null")
if d else "Null")
display(df
.withColumn("country_context", get_country_context(col("country"))))
I get the following error:
PythonException: 'AttributeError: 'DataFrame' object has no attribute 'map''
I know I don't need a udf, nor a pandas_udf for this - but i would like to understand why my function doesn't work.