Generate subsample based on age using PySpark

Question

I wanted to collect sample based on age with a condition on the Failure status. I am interested in 3 days old serial number. However, I don't need healthy serial number that is less than 3 days old, but I want to include all failed serial numbers that are less than 3 days old or exactly 3 days old. For example, C failed in January 3rd, so I need to include January 1st and 2nd for Serial C in my new sample. Serial D failed in January 4th, so I need January 3rd, 2nd, and 1st data for D. For A and B, I need 5th, 4th, and 3rd January data that is total 3 days. I don't E and F as they are younger than 3 days healthy observations. In summary, I need failed samples with 3 days before the actual failure and recent most recent 3 days of healthy observations.

url="https://gist.githubusercontent.com/JishanAhmed2019/6625009b71ade22493c256e77e1fdaf3/raw/8b51625b76a06f7d5c76b81a116ded8f9f790820/FailureSample.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("FailureSample.csv"), header=True,sep='\t')

Current format:

Expected sample:

score 0 · Answer 1 · answered Jan 29 '23 at 21:14

For me description is a bit tricky and i am not sure if i understood it correctly

I tried to do it with window functions and i was able to get similar results but i am not sure if this code is good enough

import pyspark.sql.functions as F
from pyspark.sql import Window

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

inputDf = spark.read.csv(
    "dbfs:/FileStore/shared_uploads/****.com/Observations_stackoverflow.txt",
    header=True,
    sep="\t",
).withColumn("date", F.to_date("date", "MM/dd/yyyy"))

window = Window.partitionBy("serial_number")
aggregatedDf = (
    inputDf.withColumn("max_date", F.max("date").over(window))
    .withColumn("min_date", F.min(F.col("date")).over(window))
    .withColumn("withFailure", F.max(F.col("Failure")).over(window))
)

# If failure get up to 4 last records (failure + 3 older records)
# If success skip codes with entries for less than 3 days and get up to 3 records
filteredDf = aggregatedDf.filter(
    ((F.col("withFailure") == "1") & (F.datediff(F.col("max_date"), F.col("date")) < 4))
    | (
        (F.col("withFailure") == "0")
        & (F.datediff(F.col("max_date"), F.col("min_date")) > 3)
        & (F.datediff(F.col("max_date"), F.col("date")) < 3)
    )
)

filteredDf.drop("max_date", "min_date", "withFailure").show()

Output is:

+----------+-------------+---------+---------+-------+
|      date|serial_number|Feature 1|Feature 2|Failure|
+----------+-------------+---------+---------+-------+
|2022-01-03|            A|      171|       76|      0|
|2022-01-04|            A|      241|      100|      0|
|2022-01-05|            A|      311|      124|      0|
|2022-01-03|            B|      188|       82|      0|
|2022-01-04|            B|      258|      106|      0|
|2022-01-05|            B|      328|      130|      0|
|2022-01-01|            C|       83|       10|      0|
|2022-01-02|            C|      136|       64|      0|
|2022-01-03|            C|      223|       94|      1|
|2022-01-01|            D|       80|       47|      0|
|2022-01-02|            D|      153|       70|      0|
|2022-01-03|            D|      206|       88|      0|
|2022-01-04|            D|      293|      118|      1|
+----------+-------------+---------+---------+-------+

In the output only ordering is different, records are the same as in sample output

It seems working. What did you mean here? (F.col("withFailure") == "0") & (F.datediff(F.col("max_date"), F.col("min_date")) > 3) & (F.datediff(F.col("max_date"), F.col("date")) < 3) — ForestGump, Jan 30 '23 at 00:54
Here for successfull records I am restricting number of records based on diff between max/min date (so take observation which have 3 or more records and skip those with for example 2) and then i am checking max_date/date (take records which are not older than 3 days) — M_S, Jan 30 '23 at 09:26
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it I got this error when I am running the code for 72 million observations.! — ForestGump, Jan 31 '23 at 05:27
You have this error just after job started or later during computation? — M_S, Jan 31 '23 at 10:43
During computation. It works fine with this toy data but not working with the real data. — ForestGump, Jan 31 '23 at 23:46

Generate subsample based on age using PySpark

1 Answers1