We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we start with a test Pandas DataFrame (we'll port to Spark later):
import pandas as pd
df = pd.DataFrame({"date": ["2020-01-01", "2020-01-02", "2020-01-03"] * 3,
"period": [0,0,0,1,1,1,2,2,2],
"val": [4,5,2] * 3})
Then we make a Pandas based function. Notice this is meant to be applied per group. We will partition later.
def rolling(df: pd.DataFrame) -> pd.DataFrame:
df["cum_sum"] = df["val"].rolling(2).sum().fillna(df["val"])
return df
Now we can use the Fugue transform function to test on Pandas. This function handles the partition and presort also.
from fugue import transform
transform(df, rolling, schema="*, cum_sum:float", partition={"by":"period", "presort": "date asc"})
Because this works, we can bring it to Spark just by specifying the engine:
import fugue_spark
transform(df, rolling, schema="*, cum_sum:float", partition={"by":"period", "presort": "date asc"}, engine="spark").show()
+----------+------+---+-------+
| date|period|val|cum_sum|
+----------+------+---+-------+
|2020-01-01| 0| 4| 4.0|
|2020-01-02| 0| 5| 9.0|
|2020-01-03| 0| 2| 7.0|
|2020-01-01| 1| 4| 4.0|
|2020-01-02| 1| 5| 9.0|
|2020-01-03| 1| 2| 7.0|
|2020-01-01| 2| 4| 4.0|
|2020-01-02| 2| 5| 9.0|
|2020-01-03| 2| 2| 7.0|
+----------+------+---+-------+
Notice you need .show()
now because of Spark's lazy evaluation. The Fugue transform function can take in both Pandas and Spark DataFrames and will output