Good Afternoon.
I am trying to perform a join in Pyspark that uses a complex set of conditions to produce a single value.
A minimum example of what I am trying to achieve could look like the following. Imagine a set of events that can occur at discrete times (between t=0
and t=40
). Each event has a set of three independent boolean properties that describe the nature of the event. There is some time-dependent value associated with the triggering of each property, contained in a lookup table. For each event, I would like to determine the sum of all the relevant values for that event.
My first dataframe, df_1
, is a list of the events, the time at which the event occured, and had a selection of boolean properties associated with it:
+-------------+------------+------------+------------+------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 |
+-------------+------------+------------+------------+------------+
| Event_1 | 13 | 1 | 0 | 1 |
| Event_2 | 24 | 0 | 1 | 1 |
| Event_3 | 35 | 1 | 0 | 0 |
+-------------+------------+------------+------------+------------+
The second dataframe, df_2
, is the lookup table that describes the associated value of having a TRUE for a particular property at a particular time. Since there are many repeated values across all the time buckets, the format of this dataframe is an inclusive range of times for which the property has the specific value. The time ranges are not consistently sized and can vary wildly between different properties:
+------------+----------+---------------+-------+
| START_TIME | END_TIME | PROPERTY_NAME | VALUE |
+------------+----------+---------------+-------+
| 0 | 18 | PROPERTY_1 | 0.1 |
| 19 | 40 | PROPERTY_1 | 0.8 |
| 0 | 20 | PROPERTY_2 | 0.7 |
| 20 | 24 | PROPERTY_2 | 0.3 |
| 25 | 40 | PROPERTY_2 | 0.7 |
| 0 | 40 | PROPERTY_3 | 0.5 |
+------------+----------+---------------+-------+
Desired Output:
Since Event_1
occured at time t=13
, with PROPERTY_1
and PROPERTY_3
triggered, the expected sum of the values according to df_2
should be 0.1 (from the PROPERTY_1
0-18 bucket) + 0.5 (from the PROPERTY_3
0-40 bucket) = 0.6. In the same way, Event_2
should have a value of 0.3 (remember that bucket start/end times are inclusive, so this comes from the 20-24 bucket) + 0.5 = 0.8. Finally, Event_3
= 0.8.
+-------------+------------+------------+------------+------------+-------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 | TOTAL_VALUE |
+-------------+------------+------------+------------+------------+-------------+
| Event_1 | 13 | 1 | 0 | 1 | 0.6 |
| Event_2 | 24 | 0 | 1 | 1 | 0.8 |
| Event_3 | 35 | 1 | 0 | 0 | 0.8 |
+-------------+------------+------------+------------+------------+-------------+
For my initial test dataset, in the event dataframe df_1
there are ~20,000 events spread over 2000 time buckets. Each event has ~44 properties and the length of the lookup table df_2
is ~53,000. As I would like to expand this process out to significantly more data (a couple of orders of magnitude potentially), I am very interested in a parallelisable solution to this problem. For instance, I feel like summarising df_2
as a python dictionary and broadcasting that to my executors will not be possible given the volume of data.
Since I'm trying to add a single column to each row in df_1
, I have tried to accomplish the task using a nested map that looks similar to the following:
def calculate_value(df_2):
def _calculate_value(row):
row_dict = row.asDict()
rolling_value = 0.0
for property_name in [key for key in row_dict.keys() if "PROPERTY" in key]:
additional_value = (
df_2
.filter(
(pyspark.sql.functions.col("PROPERTY_NAME") == property_name)
& (pyspark.sql.functions.col("START_BUCKET") <= row_dict["EVENT_TIME"])
& (pyspark.sql.functions.col("END_BUCKET") >= row_dict["EVENT_TIME"])
)
.select("VALUE")
.collect()
)[0][0]
rolling_value += additional_value
return pyspark.sql.Row(**row_dict)
return _calculate_value
This code is able to perform the join on the driver (by running calculate_value(df_2)(df_1.rdd.take(1)[0])
), however when I try to perform the parallelised map:
(
df_1
.rdd
.map(calculate_value(df_2))
)
I receive a Py4JError indicating that it could not seralize the dataframe object df_2
. This is verified elsewhere in StackOverflow, e.g. Pyspark: PicklingError: Could not serialize object:.
I opted to use a map rather than a join because I am adding a single column to each row in df_1
, and given the difficulty in encoding the complex logic required to identify the correct rows in df_2
to add up for each given event (First, check which properties fired and were TRUE in df_1
, then select those properties in df_2
, downselect to only the properties and values that are relevant given the event time, and then add up all the events).
I am trying to think of a way to reconfigure df_2
in a sustainable, scalable manner to allow for a more simple join/map, but I am not sure how best to go about doing it.
Any advice would be greatly appreciated.