1

I have the following spark df

id | country
------------------
1  | Null
2  | {"date": null, "value": "BRA", "context": "nationality", "state": null}
3  | {"date": null, "value": "ITA", "context": "residence", "state": null}
4  | {"date": null, "value": null, "context": null, "state": null}

And I want to create a pandas user defined function that, when run like below, would output the df like shown below:

(i'm working in databricks notebooks, the display function simply prints at the console the output of the command within parens)

display(df.withColumn("country_context", get_country_context(col("country"))))

would output

id | country      | country_context
-----------------------------------
1  | Null         | null
2  | {"date": n...| nationality 
3  | {"date": n...| residence
4  | {"date": n...| null

The pandas_udf I created is the following:

from pyspark.sql.functions import pandas_udf, col
import pandas as pd

@pandas_udf("string")
def get_country_context(country_series: pd.Series) -> pd.Series:
  return country_series.map(lambda d:
                            d.get("context", "Null") 
                            if d else "Null")

display(df
        .withColumn("country_context", get_country_context(col("country"))))

I get the following error:

PythonException: 'AttributeError: 'DataFrame' object has no attribute 'map''

I know I don't need a udf, nor a pandas_udf for this - but i would like to understand why my function doesn't work.

Tytire Recubans
  • 967
  • 10
  • 27

1 Answers1

1

I changed syntax from Series -> Series to It[Series] -> It[Series] and it works. Not sure why but it does.

@pandas_udf('string')
def my_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    return map(lambda d:d.get("context", "Null"), iterator)
Tytire Recubans
  • 967
  • 10
  • 27
  • 1
    Changing the type has not been the only change. In the question, `country_series.map` method is called, while in the answer, you call the [python's `map` in-built method](https://docs.python.org/3/library/functions.html#map) - that's the important part of the solution. – Marek Oct 18 '21 at 15:44