I have a PySpark DF like below (all in Databricks Notebook):
+-------+--------------------------+------+----------+
|city |brand |weight|date |
+-------+--------------------------+------+----------+
|Dallas |['BMW', 'Ford', 'Chevy'] |0.94 |2021-05-05|
|Chicago|['Ford', 'VW', 'Toyota'] |0.92 |2021-05-07|
|Boston |['BMW', 'Toyota', 'Chevy']|0.78 |2021-05-08|
|Atlanta|['Toyota', 'VW', 'Chevy'] |0.83 |2021-05-09|
|Phoenix|['BMW', 'Honda', 'Toyota']|0.89 |2021-05-12|
+-------+--------------------------+------+----------+
Each row is distinct...in the sense that even if two rows have same date (and same brand values), the city will be different.
I want to do this: For each row,
I want to get the list of car-brands.
For each of those car-brands, I want to formulate a sql-query to a databricks table and fetch some text. My SQL query looks like this:
sql_stmt = f"SELECT text FROM {data_src_tbl} where to_date(date) == '{dt}' and city == '{city}' and text LIKE '%{car-brand}%';"
where dt is from date-column, city is from city-column and car-brand is each element from the list (for eg: for row-0: ['BMW', 'Ford', 'Chevy']). The intent is to fetch (using sql query to a table in databricks) all text related to ('BMW', , ), concatenate them as a single long string. Likewise, do the same for other elements of this list in first row. I have this parameterized query perfected and work independently. The real question is down below..
The desired output (for first row):
+-------+--------------------------------------------------------+------+----------+
|city |texts |weight|date |
+-------+--------------------------------------------------------+------+----------+
|Dallas |[{'BMW': "....", 'Ford': "....", 'Chevy': "..."} ] |0.94 |2021-05-05|
|Chicago|[{'Ford': "....", 'VW': "...", 'Toyota':"...."}] |0.92 |2021-05-07|
|Boston |['BMW': "....", 'Toyota': "....", 'Chevy': "...."}] |0.78 |2021-05-08|
|Atlanta|['Toyota': "....", 'VW': "....", 'Chevy': "...."] |0.83 |2021-05-09|
|Phoenix|['BMW': "....", 'Honda': "....", 'Toyota': "...."] |0.89 |2021-05-12|
+-------+--------------------------------------------------------+------+----------+
OR it could be like this also (for first row):
+-------+--------------------------------------------------------+------+----------+
|city |texts |weight|date |
+-------+--------------------------------------------------------+------+----------+
|Dallas |{"BMW": "...."} |0.94 |2021-05-05|
|Dallas |{"Ford": "...."} |0.94 |2021-05-05|
|Dallas |{"Chevy": "...."} |0.94 |2021-05-05|
OR the brand and text could be in separate columns. The point is I want to retrieve texts corresponding to each brand, city and date somehow.
This is what I have done so far:
- created a UDF (that accepts dt, city, brand-list as input args) and for each dt, city and brand...
- creates a sql-stmt (parameterized) like the one I mentioned above
- Execute the sql like this:
df = spark.sql(sql_stmt)
Problem I faced:
Since I am executing a spark.sql
stmt from within an UDF, it throws this error:
RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Question: Is there a way to achieve what I am looking to do? Either using UDF or without (I don't care). Any suggestions (with code examples) would be highly appreciated.
Some SO references I have looked so far: