PicklingError: Could not serialize object: when calling exchange rate API into my pyspark dataframe

Question

I am beginner in databricks and pyspark. Currently, I have a pyspark dataframe which contains 3 columns :

Date
amount
Currency

I would like to have the amount columns converted in EUR and calculated with the exchange rate of the day. For that purpose, I am using the exchange rate API to find the exchange rate by taking the date and currency as parameters.

First, I defined a function which make the API call to find the exchange rate

here is my code :

def API(val1,currency,date):
  r = requests.get('https://api.exchangeratesapi.io/'+date,params={'symbols':currency})
  df = spark.read.json(sc.parallelize([r.json()]))
  df_value = df.select(F.col("rates."+currency))
  value = df_value.collect()[0][2]
  val = val1*(1/value)

  return float(val)

Then, I defined a UDF to call this function inside my dataframe:

API_Convert = F.udf(lambda x,y,z : API(x,y,z) if (y!= 'EUR') else x, FloatType())

When I try to execute this part I get the pickling error which I absolutely don't understand...

df = df.withColumn('amount',API_Convert('amount','currency','date'))

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Could you please help me to fix this issue ?

Many Thanks

I think it's the spark.read within the `API` function that's the reason. You can only use python functions within an `udf`. I faced a similar issue due to this, learned the hard way. — samkart, Nov 08 '20 at 16:19
Thanks for the tip, I will deep dive on this track but I think you are correct ! Well now I don't know how to tackle this issue ^^ — Jeg, Nov 08 '20 at 17:12
can join(s) not help? It appears you could melt the exchange rates and then join based on the currency. — samkart, Nov 09 '20 at 07:16
yes, I am exactly on this track... but the dataframe returned is a nested struct which is not very comfy to handle. So much effort for a small transformation work ^^ — Jeg, Nov 09 '20 at 18:32
in pandas, I'd have done - `pd.DataFrame.from_dict(requests.get('https://api.exchangeratesapi.io/'+'2020-11-09').json()['rates'], orient='index').reset_index()`. This can be converted to a spark dataframe after transformations of choice. — samkart, Nov 10 '20 at 05:26
Wonderful idea @samkart ! Thanks a lot for your help, I think can now move smoothly. Thanks a lot for your help — Jeg, Nov 10 '20 at 10:08

PicklingError: Could not serialize object: when calling exchange rate API into my pyspark dataframe

0 Answers0