I have the following problem. I have a dataset that keeps track of changes of a status.
id valid eventdate
1 False 2020-05-01
1 True 2020-05-06
2 True 2020-05-04
2 False 2020-05-07
2 True 2020-05-09
3 False 2020-05-11
Goal:
SELECT valid FROM table WHERE id = 1 AND eventdate = "2020-05-05"
I need to know at any given date (within the start and today) what the status is for a given day. For example, for id
1 valid was still False
on the 5th of May.
In Pandas I have a solution like this, where I use pivot
and ffill
for filling in the empty values. I use a melt to make it back into a three column dataframe.
from datetime import datetime
import pandas as pd
test_data = [
[1,"False","2020-05-01"],
[1,"True","2020-05-06"],
[2,"True","2020-05-04"],
[2,"False","2020-05-07"],
[2,"True","2020-05-09"],
[3,"False","2020-05-11"]
]
# Create inputframe
df = pd.DataFrame(test_data, columns=['id', 'valid', 'eventdate'])
df['id'] = df['id'].astype(str)
df['valid'] = df['valid'] == "True"
df['eventdate'] = pd.to_datetime(df['eventdate'])
print(df.head(6))
# id valid eventdate
# 0 1 False 2020-05-01
# 1 1 True 2020-05-06
# 2 2 True 2020-05-04
# 3 2 False 2020-05-07
# 4 2 True 2020-05-09
# 5 3 False 2020-05-11
# Create full time range as frame
timeframe = pd.date_range(start=min(df['eventdate']),
end=datetime.now().date()).to_frame().reset_index(drop=True).rename(columns={0: 'eventdate'})
print(timeframe.head())
# eventdate
# 0 2020-05-01
# 1 2020-05-02
# 2 2020-05-03
# 3 2020-05-04
# 4 2020-05-05
# Merge timeframe into original frame
df = df.merge(timeframe,
left_on='eventdate',
right_on='eventdate',
how='right')
print(df.sort_values('eventdate').head())
# id valid eventdate
# 0 1 False 2020-05-01
# 6 NaN NaN 2020-05-02
# 7 NaN NaN 2020-05-03
# 2 2 True 2020-05-04
# 8 NaN NaN 2020-05-05
# 1. Pivot to get dates on rows and ids as columns
# 2. Forward fill values per id
# 3. Fill remaining NaNs with False
df = df.pivot(index='eventdate',
columns='id',
values='valid')\
.fillna(method='ffill')\
.fillna(False)
print(df.head())
# id NaN 1 2 3
# eventdate
# 2020-05-01 False False False False
# 2020-05-02 False False False False
# 2020-05-03 False False False False
# 2020-05-04 False False True False
# 2020-05-05 False False True False
# Drop NaN column and reset the index
df = df.loc[:, df.columns.notnull()].reset_index()
# Melt the columns back
out = pd.melt(df,
id_vars='eventdate',
value_name='valid')
print(out.head(10))
# eventdate id valid
# 0 2020-05-01 1 False
# 1 2020-05-02 1 False
# 2 2020-05-03 1 False
# 3 2020-05-04 1 False
# 4 2020-05-05 1 False
# 5 2020-05-06 1 True
# 6 2020-05-07 1 True
# 7 2020-05-08 1 True
# 8 2020-05-09 1 True
# 9 2020-05-10 1 True
I am trying to achieve the same in Spark, but the forward fill does not exist. I know how to achieve the latest status per id
:
w = Window().partitionBy("id").orderBy(F.col("eventdate").desc())
df.withColumn("rn", F.row_number().over(w)) \
.where(F.col("rn") == 1) \
.selectExpr("id", "valid", "eventdate AS last_change") \
.dropna() \
.show()
Pivoting can be done with:
df\
.select(["id", "valid", "eventdate"])\
.groupBy(["eventdate"])\
.pivot("id")\
.agg(F.min("valid"))\
.drop('null')\
.sort('eventdate')\
.show()
To do the forward fill I got this far by limiting the dataset to only one id
:
import sys
from datetime import datetime
import pyspark.sql.functions as F
from pyspark.sql import Window
test_data = [
[1,"False","2020-05-01"],
[1,"True","2020-05-06"],
[2,"True","2020-05-04"],
[2,"False","2020-05-07"],
[2,"True","2020-05-09"],
[3,"False","2020-05-11"]
]
# Create dataframe
df = sc\
.parallelize(test_data)\
.toDF(("id", "valid", "eventdate"))\
.withColumn("eventdate", F.to_date(F.to_timestamp("eventdate")))\
.withColumn("valid", F.when(F.col("valid") == "True", 1).otherwise(0))
df.createOrReplaceTempView("df")
# Create event frame
event_dates = spark.sql("SELECT sequence(min(eventdate), CURRENT_DATE(), interval 1 day) as eventdate FROM df")\
.withColumn("eventdate",
F.explode(F.col("eventdate")))
# Join dates and data
df = df.join(event_dates, on='eventdate', how='right')
df2 = df.where(df.id == 1)\
.join(event_dates, on='eventdate', how='right')\
.withColumn('id', F.lit(1))
#df2.sort('eventdate').show()
# +----------+---+-----+
# | eventdate| id|valid|
# +----------+---+-----+
# |2020-05-01| 1| 0|
# |2020-05-02| 1| null|
# |2020-05-03| 1| null|
# |2020-05-04| 1| null|
# |2020-05-05| 1| null|
# |2020-05-06| 1| 1|
# |2020-05-07| 1| null|
# |2020-05-08| 1| null|
# |2020-05-09| 1| null|
# |2020-05-10| 1| null|
# |2020-05-11| 1| null|
# |2020-05-12| 1| null|
# |2020-05-13| 1| null|
# +----------+---+-----+
# Forward fill
window = Window.partitionBy('id')\
.orderBy('eventdate')\
.rowsBetween(-sys.maxsize, 0)
# Set filter
read_last = F.last(df2['valid'], ignorenulls=True).over(window)
df2.withColumn("ffill", read_last).show()
# +----------+---+-----+-----+
# | eventdate| id|valid|ffill|
# +----------+---+-----+-----+
# |2020-05-01| 1| 0| 0|
# |2020-05-02| 1| null| 0|
# |2020-05-03| 1| null| 0|
# |2020-05-04| 1| null| 0|
# |2020-05-05| 1| null| 0|
# |2020-05-06| 1| 1| 1|
# |2020-05-07| 1| null| 1|
# |2020-05-08| 1| null| 1|
# |2020-05-09| 1| null| 1|
# |2020-05-10| 1| null| 1|
# |2020-05-11| 1| null| 1|
# |2020-05-12| 1| null| 1|
# |2020-05-13| 1| null| 1|
# +----------+---+-----+-----+
I think the first thing is if this approach to answer the question is the right approach. Doing the pivot
will create a long table with a few columns, while a lot of redundant data is stored. Spark is not the right tool for the problem, or better, the problem doesn't lend itself to use Spark. I know that ideally you would need to use the parallel processing and perhaps broadcast the timeframe
to all nodes and calculate the forward fill per id
per node?
Would it be better to use some different approach, for example, storing the enddate
of an event and when you query use something like this:
id valid eventdate enddate
1 False 2020-05-01 2020-05-06
1 True 2020-05-06 2999-12-31
2 True 2020-05-04 2020-05-07
2 False 2020-05-07 2020-05-08
2 True 2020-05-09 2999-12-31
3 False 2020-05-11 2999-12-31
and
SELECT valid FROM table WHERE id = 1 AND "2020-05-05" between eventdate and enddate
Please let me know if the Spark approach makes at all, and what is the best way to find the status at any given calendar state for such a sparse dataset?
Thank you.