PySpark execute SQL from within a UDF

Question

I have a PySpark DF like below (all in Databricks Notebook):

+-------+--------------------------+------+----------+
|city   |brand                     |weight|date      |
+-------+--------------------------+------+----------+
|Dallas |['BMW', 'Ford', 'Chevy']  |0.94  |2021-05-05|
|Chicago|['Ford', 'VW', 'Toyota']  |0.92  |2021-05-07|
|Boston |['BMW', 'Toyota', 'Chevy']|0.78  |2021-05-08|
|Atlanta|['Toyota', 'VW', 'Chevy'] |0.83  |2021-05-09|
|Phoenix|['BMW', 'Honda', 'Toyota']|0.89  |2021-05-12|
+-------+--------------------------+------+----------+

Each row is distinct...in the sense that even if two rows have same date (and same brand values), the city will be different.

I want to do this: For each row,

I want to get the list of car-brands.
For each of those car-brands, I want to formulate a sql-query to a databricks table and fetch some text. My SQL query looks like this:

sql_stmt = f"SELECT text FROM {data_src_tbl} where to_date(date) == '{dt}' and city == '{city}' and text LIKE '%{car-brand}%';"

where dt is from date-column, city is from city-column and car-brand is each element from the list (for eg: for row-0: ['BMW', 'Ford', 'Chevy']). The intent is to fetch (using sql query to a table in databricks) all text related to ('BMW', , ), concatenate them as a single long string. Likewise, do the same for other elements of this list in first row. I have this parameterized query perfected and work independently. The real question is down below..

The desired output (for first row):

+-------+--------------------------------------------------------+------+----------+
|city   |texts                                                   |weight|date      |
+-------+--------------------------------------------------------+------+----------+
|Dallas |[{'BMW': "....", 'Ford': "....", 'Chevy': "..."} ]      |0.94  |2021-05-05|
|Chicago|[{'Ford': "....", 'VW': "...", 'Toyota':"...."}]        |0.92  |2021-05-07|
|Boston |['BMW': "....", 'Toyota': "....", 'Chevy': "...."}]     |0.78  |2021-05-08|
|Atlanta|['Toyota': "....", 'VW': "....", 'Chevy': "...."]       |0.83  |2021-05-09|
|Phoenix|['BMW': "....", 'Honda': "....", 'Toyota': "...."]      |0.89  |2021-05-12|
+-------+--------------------------------------------------------+------+----------+

OR it could be like this also (for first row):

+-------+--------------------------------------------------------+------+----------+
|city   |texts                                                   |weight|date      |
+-------+--------------------------------------------------------+------+----------+
|Dallas |{"BMW": "...."}                                         |0.94  |2021-05-05|
|Dallas |{"Ford": "...."}                                        |0.94  |2021-05-05|
|Dallas |{"Chevy": "...."}                                       |0.94  |2021-05-05|

OR the brand and text could be in separate columns. The point is I want to retrieve texts corresponding to each brand, city and date somehow.

This is what I have done so far:

created a UDF (that accepts dt, city, brand-list as input args) and for each dt, city and brand...
creates a sql-stmt (parameterized) like the one I mentioned above
Execute the sql like this: df = spark.sql(sql_stmt)

Problem I faced: Since I am executing a spark.sql stmt from within an UDF, it throws this error:

RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Question: Is there a way to achieve what I am looking to do? Either using UDF or without (I don't care). Any suggestions (with code examples) would be highly appreciated.

Some SO references I have looked so far:

score -1 · Answer 1 · answered Jun 03 '23 at 17:01

I got around this problem by not using an UDF. I instead extracted each column value as a list, put them all in a list of tuples. And for each item in the tuple (date, city, weight, brand_list). For each date, city: and for each of the brands in brand-list, I formulated a sql-stmt and executed it and collected in a a larger tuple and made a DF out of it.

Note that this doesn't look like an elegant solution in the sense that we are using only the drive-node and not all the worker nodes. My data is small and I do not care at this point. But, I will have to revisit...if things become large.

For you use case, there is nothing that prevents you from using sql joins, all the brands and texts value are struct columns in your tables which can be accessed using dot notation. Look into explode, posexplode, lateral views etc. Also, you are thinking in terms of loops, you need to think in terms of [set operations](https://youtu.be/DBinD6H0i68?t=100) — adrien, Aug 10 '23 at 06:34

PySpark execute SQL from within a UDF

1 Answers1