Joining Pyspark Dataframes Using Complex Conditional Logic (Perhaps Using a Map Instead)

Question

Good Afternoon.

I am trying to perform a join in Pyspark that uses a complex set of conditions to produce a single value.

A minimum example of what I am trying to achieve could look like the following. Imagine a set of events that can occur at discrete times (between t=0 and t=40). Each event has a set of three independent boolean properties that describe the nature of the event. There is some time-dependent value associated with the triggering of each property, contained in a lookup table. For each event, I would like to determine the sum of all the relevant values for that event.

My first dataframe, df_1, is a list of the events, the time at which the event occured, and had a selection of boolean properties associated with it:

+-------------+------------+------------+------------+------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 |
+-------------+------------+------------+------------+------------+
|   Event_1   |     13     |     1      |      0     |     1      |
|   Event_2   |     24     |     0      |      1     |     1      |
|   Event_3   |     35     |     1      |      0     |     0      |
+-------------+------------+------------+------------+------------+

The second dataframe, df_2, is the lookup table that describes the associated value of having a TRUE for a particular property at a particular time. Since there are many repeated values across all the time buckets, the format of this dataframe is an inclusive range of times for which the property has the specific value. The time ranges are not consistently sized and can vary wildly between different properties:

+------------+----------+---------------+-------+
| START_TIME | END_TIME | PROPERTY_NAME | VALUE |
+------------+----------+---------------+-------+
|      0     |    18    |  PROPERTY_1   |  0.1  |
|     19     |    40    |  PROPERTY_1   |  0.8  |
|      0     |    20    |  PROPERTY_2   |  0.7  |
|     20     |    24    |  PROPERTY_2   |  0.3  |
|     25     |    40    |  PROPERTY_2   |  0.7  |
|      0     |    40    |  PROPERTY_3   |  0.5  |
+------------+----------+---------------+-------+

Desired Output: Since Event_1 occured at time t=13, with PROPERTY_1 and PROPERTY_3 triggered, the expected sum of the values according to df_2 should be 0.1 (from the PROPERTY_1 0-18 bucket) + 0.5 (from the PROPERTY_3 0-40 bucket) = 0.6. In the same way, Event_2 should have a value of 0.3 (remember that bucket start/end times are inclusive, so this comes from the 20-24 bucket) + 0.5 = 0.8. Finally, Event_3 = 0.8.

+-------------+------------+------------+------------+------------+-------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 | TOTAL_VALUE |
+-------------+------------+------------+------------+------------+-------------+
|   Event_1   |     13     |     1      |      0     |     1      |     0.6     |
|   Event_2   |     24     |     0      |      1     |     1      |     0.8     |
|   Event_3   |     35     |     1      |      0     |     0      |     0.8     |
+-------------+------------+------------+------------+------------+-------------+

For my initial test dataset, in the event dataframe df_1 there are ~20,000 events spread over 2000 time buckets. Each event has ~44 properties and the length of the lookup table df_2 is ~53,000. As I would like to expand this process out to significantly more data (a couple of orders of magnitude potentially), I am very interested in a parallelisable solution to this problem. For instance, I feel like summarising df_2 as a python dictionary and broadcasting that to my executors will not be possible given the volume of data.

Since I'm trying to add a single column to each row in df_1, I have tried to accomplish the task using a nested map that looks similar to the following:

def calculate_value(df_2):
    def _calculate_value(row):
        row_dict = row.asDict()
        rolling_value = 0.0
        for property_name in [key for key in row_dict.keys() if "PROPERTY" in key]:
            additional_value = (
                df_2
                .filter(
                    (pyspark.sql.functions.col("PROPERTY_NAME") == property_name)
                    & (pyspark.sql.functions.col("START_BUCKET") <= row_dict["EVENT_TIME"])
                    & (pyspark.sql.functions.col("END_BUCKET") >= row_dict["EVENT_TIME"])
                )
                .select("VALUE")
                .collect()
            )[0][0]
            rolling_value += additional_value
        return pyspark.sql.Row(**row_dict)
    return _calculate_value

This code is able to perform the join on the driver (by running calculate_value(df_2)(df_1.rdd.take(1)[0])), however when I try to perform the parallelised map:

(
    df_1
    .rdd
    .map(calculate_value(df_2))
)

I receive a Py4JError indicating that it could not seralize the dataframe object df_2. This is verified elsewhere in StackOverflow, e.g. Pyspark: PicklingError: Could not serialize object:.

I opted to use a map rather than a join because I am adding a single column to each row in df_1, and given the difficulty in encoding the complex logic required to identify the correct rows in df_2 to add up for each given event (First, check which properties fired and were TRUE in df_1, then select those properties in df_2, downselect to only the properties and values that are relevant given the event time, and then add up all the events).

I am trying to think of a way to reconfigure df_2 in a sustainable, scalable manner to allow for a more simple join/map, but I am not sure how best to go about doing it.

Any advice would be greatly appreciated.

What is your `spark version`? – murtihash Apr 10 '20 at 23:33 — murtihash, Apr 10 '20 at 23:33
Hi Mohammad, my spark version is 2.4.4. – Matty Jones Apr 13 '20 at 15:35 — Matty Jones, Apr 13 '20 at 15:35

score 0 · Accepted Answer · answered Apr 11 '20 at 00:16

Sample DataFrames:

df1.show()
+-----------+----------+----------+----------+----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|
+-----------+----------+----------+----------+----------+
|    Event_1|        13|         1|         0|         1|
|    Event_2|        24|         0|         1|         1|
|    Event_3|        35|         1|         0|         0|
+-----------+----------+----------+----------+----------+

df2.show()
+----------+--------+-------------+-----+
|START_TIME|END_TIME|PROPERTY_NAME|VALUE|
+----------+--------+-------------+-----+
|         0|      18|   PROPERTY_1|  0.1|
|        19|      40|   PROPERTY_1|  0.8|
|         0|      20|   PROPERTY_2|  0.7|
|        20|      24|   PROPERTY_2|  0.3|
|        25|      40|   PROPERTY_2|  0.7|
|         0|      40|   PROPERTY_3|  0.5|
+----------+--------+-------------+-----+

This works for Spark2.4+ using DataframeAPI.(very scalable as it only uses in-built functions and it is dynamic for as many property columns)

It will work for as many properties as it is dynamic for them as long as the properties columns start with 'PROPERTY_'. First I will use arrays_zip and array and explode to collapse all Property columns into rows with 2 columns using element_at to give us PROPERY_NAME,PROPERTY_VALUE. Before join, I will filter to only keep all rows where the PROPERY_VALUE=1. The join will take place on the range of time and where PROPERTY(with all collapsed rows of properties)=PROPERTY_NAMES(of df2). This will ensure that we only get all the rows needed for our sum. Then I perform a groupBy with agg to select all our required columns and to get our total sum as TOTAL_VALUE.

from pyspark.sql import functions as F
df1.withColumn("PROPERTIES",\
F.explode(F.arrays_zip(F.array([F.array(F.lit(x),F.col(x)) for x in df1.columns if x.startswith("PROPERTY_")]))))\
.select("EVENT_INDEX", "EVENT_TIME","PROPERTIES.*",\
       *[x for x in df1.columns if x.startswith("PROPERTY_")]).withColumn("PROPERTY", F.element_at("0",1))\
                                                    .withColumn("PROPERTY_VALUE", F.element_at("0",2)).drop("0")\
.filter('PROPERTY_VALUE=1').join(df2, (df1.EVENT_TIME>=df2.START_TIME) & (df1.EVENT_TIME<=df2.END_TIME)& \
(F.col("PROPERTY")==df2.PROPERTY_NAME)).groupBy("EVENT_INDEX").agg(F.first("EVENT_TIME").alias("EVENT_TIME"),\
*[F.first(x).alias(x) for x in df1.columns if x.startswith("PROPERTY_")],\
(F.sum("VALUE").alias("TOTAL_VALUE"))).orderBy("EVENT_TIME").show()

+-----------+----------+----------+----------+----------+-----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|TOTAL_VALUE|
+-----------+----------+----------+----------+----------+-----------+
|    Event_1|        13|         1|         0|         1|        0.6|
|    Event_2|        24|         0|         1|         1|        0.8|
|    Event_3|        35|         1|         0|         0|        0.8|
+-----------+----------+----------+----------+----------+-----------+

Thank you for your response! I appreciate you taking the time to teach me this new Spark functionality. — Matty Jones, Apr 13 '20 at 15:37

Joining Pyspark Dataframes Using Complex Conditional Logic (Perhaps Using a Map Instead)

1 Answers1