3

I have the following problem. I have a dataset that keeps track of changes of a status.

id  valid  eventdate
 1  False 2020-05-01
 1   True 2020-05-06
 2   True 2020-05-04
 2  False 2020-05-07
 2   True 2020-05-09
 3  False 2020-05-11

Goal:

SELECT valid FROM table WHERE id = 1 AND eventdate = "2020-05-05"

I need to know at any given date (within the start and today) what the status is for a given day. For example, for id 1 valid was still False on the 5th of May.

In Pandas I have a solution like this, where I use pivot and ffill for filling in the empty values. I use a melt to make it back into a three column dataframe.

from datetime import datetime
import pandas as pd

test_data = [
  [1,"False","2020-05-01"],
  [1,"True","2020-05-06"],
  [2,"True","2020-05-04"],
  [2,"False","2020-05-07"],
  [2,"True","2020-05-09"],
  [3,"False","2020-05-11"]
]
# Create inputframe
df = pd.DataFrame(test_data, columns=['id', 'valid', 'eventdate'])
df['id'] = df['id'].astype(str)
df['valid'] = df['valid'] == "True"
df['eventdate'] = pd.to_datetime(df['eventdate'])
print(df.head(6))
#   id  valid  eventdate
# 0  1  False 2020-05-01
# 1  1   True 2020-05-06
# 2  2   True 2020-05-04
# 3  2  False 2020-05-07
# 4  2   True 2020-05-09
# 5  3  False 2020-05-11

# Create full time range as frame
timeframe = pd.date_range(start=min(df['eventdate']),
                          end=datetime.now().date()).to_frame().reset_index(drop=True).rename(columns={0: 'eventdate'})
print(timeframe.head())
#    eventdate
# 0 2020-05-01
# 1 2020-05-02
# 2 2020-05-03
# 3 2020-05-04
# 4 2020-05-05

# Merge timeframe into original frame
df = df.merge(timeframe,
              left_on='eventdate',
              right_on='eventdate',
              how='right')
print(df.sort_values('eventdate').head())
#     id  valid  eventdate
# 0    1  False 2020-05-01
# 6  NaN    NaN 2020-05-02
# 7  NaN    NaN 2020-05-03
# 2    2   True 2020-05-04
# 8  NaN    NaN 2020-05-05

# 1. Pivot to get dates on rows and ids as columns
# 2. Forward fill values per id
# 3. Fill remaining NaNs with False
df = df.pivot(index='eventdate',
              columns='id',
              values='valid')\
       .fillna(method='ffill')\
       .fillna(False)
print(df.head())
# id            NaN      1      2      3
# eventdate                             
# 2020-05-01  False  False  False  False
# 2020-05-02  False  False  False  False
# 2020-05-03  False  False  False  False
# 2020-05-04  False  False   True  False
# 2020-05-05  False  False   True  False

# Drop NaN column and reset the index
df = df.loc[:, df.columns.notnull()].reset_index()
# Melt the columns back
out = pd.melt(df,
              id_vars='eventdate',
              value_name='valid')
print(out.head(10))
#    eventdate id  valid
# 0 2020-05-01  1  False
# 1 2020-05-02  1  False
# 2 2020-05-03  1  False
# 3 2020-05-04  1  False
# 4 2020-05-05  1  False
# 5 2020-05-06  1   True
# 6 2020-05-07  1   True
# 7 2020-05-08  1   True
# 8 2020-05-09  1   True
# 9 2020-05-10  1   True

I am trying to achieve the same in Spark, but the forward fill does not exist. I know how to achieve the latest status per id:

w = Window().partitionBy("id").orderBy(F.col("eventdate").desc())
df.withColumn("rn", F.row_number().over(w)) \
  .where(F.col("rn") == 1) \
  .selectExpr("id", "valid", "eventdate AS last_change") \
  .dropna() \
  .show()

Pivoting can be done with:

df\
.select(["id", "valid", "eventdate"])\
.groupBy(["eventdate"])\
.pivot("id")\
.agg(F.min("valid"))\
.drop('null')\
.sort('eventdate')\
.show()

To do the forward fill I got this far by limiting the dataset to only one id:

import sys
from datetime import datetime
import pyspark.sql.functions as F
from pyspark.sql import Window

test_data = [
  [1,"False","2020-05-01"],
  [1,"True","2020-05-06"],
  [2,"True","2020-05-04"],
  [2,"False","2020-05-07"],
  [2,"True","2020-05-09"],
  [3,"False","2020-05-11"]
]
# Create dataframe
df = sc\
  .parallelize(test_data)\
  .toDF(("id", "valid", "eventdate"))\
  .withColumn("eventdate", F.to_date(F.to_timestamp("eventdate")))\
  .withColumn("valid", F.when(F.col("valid") == "True", 1).otherwise(0))
df.createOrReplaceTempView("df")
# Create event frame
event_dates = spark.sql("SELECT sequence(min(eventdate), CURRENT_DATE(), interval 1 day) as eventdate FROM df")\
                   .withColumn("eventdate",
                               F.explode(F.col("eventdate")))
# Join dates and data
df = df.join(event_dates, on='eventdate', how='right')

df2 = df.where(df.id == 1)\
  .join(event_dates, on='eventdate', how='right')\
  .withColumn('id', F.lit(1))
#df2.sort('eventdate').show()
# +----------+---+-----+
# | eventdate| id|valid|
# +----------+---+-----+
# |2020-05-01|  1|    0|
# |2020-05-02|  1| null|
# |2020-05-03|  1| null|
# |2020-05-04|  1| null|
# |2020-05-05|  1| null|
# |2020-05-06|  1|    1|
# |2020-05-07|  1| null|
# |2020-05-08|  1| null|
# |2020-05-09|  1| null|
# |2020-05-10|  1| null|
# |2020-05-11|  1| null|
# |2020-05-12|  1| null|
# |2020-05-13|  1| null|
# +----------+---+-----+

# Forward fill
window = Window.partitionBy('id')\
               .orderBy('eventdate')\
               .rowsBetween(-sys.maxsize, 0)
# Set filter
read_last = F.last(df2['valid'], ignorenulls=True).over(window)
df2.withColumn("ffill", read_last).show()
# +----------+---+-----+-----+
# | eventdate| id|valid|ffill|
# +----------+---+-----+-----+
# |2020-05-01|  1|    0|    0|
# |2020-05-02|  1| null|    0|
# |2020-05-03|  1| null|    0|
# |2020-05-04|  1| null|    0|
# |2020-05-05|  1| null|    0|
# |2020-05-06|  1|    1|    1|
# |2020-05-07|  1| null|    1|
# |2020-05-08|  1| null|    1|
# |2020-05-09|  1| null|    1|
# |2020-05-10|  1| null|    1|
# |2020-05-11|  1| null|    1|
# |2020-05-12|  1| null|    1|
# |2020-05-13|  1| null|    1|
# +----------+---+-----+-----+

I think the first thing is if this approach to answer the question is the right approach. Doing the pivot will create a long table with a few columns, while a lot of redundant data is stored. Spark is not the right tool for the problem, or better, the problem doesn't lend itself to use Spark. I know that ideally you would need to use the parallel processing and perhaps broadcast the timeframe to all nodes and calculate the forward fill per id per node?

Would it be better to use some different approach, for example, storing the enddate of an event and when you query use something like this:

id  valid  eventdate enddate
 1  False 2020-05-01 2020-05-06
 1   True 2020-05-06 2999-12-31
 2   True 2020-05-04 2020-05-07
 2  False 2020-05-07 2020-05-08
 2   True 2020-05-09 2999-12-31
 3  False 2020-05-11 2999-12-31

and

SELECT valid FROM table WHERE id = 1 AND "2020-05-05" between eventdate and enddate

Please let me know if the Spark approach makes at all, and what is the best way to find the status at any given calendar state for such a sparse dataset?

Thank you.

JQadrad
  • 541
  • 2
  • 16

1 Answers1

3

For spark2.4+ you can use sequence, and then explode it to forward fill. Also I assumed ur date was in this format yyyy-MM-dd

df.show() #sample dataframe

#+---+-----+----------+
#| id|valid| eventdate|
#+---+-----+----------+
#|  1|false|2020-05-01|
#|  1| true|2020-05-06|
#|  2| true|2020-05-04|
#|  2|false|2020-05-07|
#|  2| true|2020-05-09|
#|  3|false|2020-05-11|
#+---+-----+----------+

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy("id").orderBy(F.to_date("eventdate","yyyy-MM-dd"))

df.withColumn("lead", F.lead("eventdate").over(w))\
  .withColumn("sequence", F.when(F.col("lead").isNotNull(),
                                 F.expr("""sequence(to_date(eventdate),date_sub(to_date(lead),1), interval 1 day)"""))\
                                 .otherwise(F.array("eventdate")))\
 .select("id","valid",F.explode("sequence").alias("eventdate"))\
 .show(truncate=False)


#+---+-----+----------+
#|id |valid|eventdate |
#+---+-----+----------+
#|1  |false|2020-05-01|
#|1  |false|2020-05-02|
#|1  |false|2020-05-03|
#|1  |false|2020-05-04|
#|1  |false|2020-05-05|
#|1  |true |2020-05-06|
#|3  |false|2020-05-11|
#|2  |true |2020-05-04|
#|2  |true |2020-05-05|
#|2  |true |2020-05-06|
#|2  |false|2020-05-07|
#|2  |false|2020-05-08|
#|2  |true |2020-05-09|
#+---+-----+----------+
murtihash
  • 8,030
  • 1
  • 14
  • 26
  • 1
    Great answer. Thank you. I modified it slightly to make the sequence until `current_date()` by replacing the empty `lead` with the current date. Would you know how expensive this would get for 10000 ids with a date range of more than a year? – JQadrad May 13 '20 at 21:43
  • using explode on sequence at that scale will def not be a small operation but i think it will outerperform the alternative which is to join on a date range with a range of dates in separate df. i think its ur best bet – murtihash May 13 '20 at 21:48