0

I need to pick a data with a timestamp hourly frequency. But since sometimes timestamp is not available with the exact hour so I need to pick the data with timestamp nearest to the time. This is the dataframe I have below.

| job_id| timestamp                   |item_values   |
|:----  |:---------------------------:|:------------:|
| job1  | 2022-02-15T23:40:00.000+0000|[0.2,3.4,13.2]|       
| job1  | 2022-02-15T23:10:00.000+0000|[0.1,2.9,11.2]|
| job2  | 2022-02-15T23:40:00.000+0000|[1.2,3.1,16.0]|
| job1  | 2022-02-15T23:20:00.000+0000|[0.4,0.4,16.2]|       
| job2  | 2022-02-15T23:10:00.000+0000|[0.7,8.4,11.2]|
| job2  | 2022-02-15T23:20:00.000+0000|[0.3,1.5,19.1]|  
| job3  | 2022-02-15T23:20:00.000+0000|[0.7,7.4,13.2]|
| job3  | 2022-02-15T23:40:00.000+0000|[0.7,7.4,13.2]|
| job3  | 2022-02-15T23:10:00.000+0000|[0.7,7.4,13.2]|  

For example,
if I need to pick this time below
2023-04-25T22:00:00.000+0000

and it's not available in the table but 2023-04-25T22:10:00.000+0000 is the closet time to 2023-04-25T22:00:00.000+0000, I want to pick the data with the timestamp 2023-04-25T22:10:00.000+0000.

the output I want is,

| job_id| timestamp                   |item_values   |
|:----  |:---------------------------:|:------------:|       
| job1  | 2022-02-15T23:10:00.000+0000|[0.1,2.9,11.2]|       
| job2  | 2022-02-15T23:10:00.000+0000|[0.7,8.4,11.2]|
| job3  | 2022-02-15T23:10:00.000+0000|[0.7,7.4,13.2]|

How can I achieve this using Pyspark? Any help is greatly appreciated!

Dipanjan Mallick
  • 1,636
  • 2
  • 8
  • 20
MMV
  • 164
  • 10

2 Answers2

0

You want to obtain the rows with closest timestamp to a predefined timestamp of your choice. There are two ways I can think of solving this task, although both share the same core idea: create a column with the absolute difference between the timestamp column, and the timestamp you are looking for. Then select the row(s) with the minimum in this created column.

Edit: Obtaining timestamp difference is explained here.

Approach 1

Identify the closest timestamp value in the dataset to your target timestamp, then filter imposing equallity to the found timestamp:

  1. Select only your timestamp column and apply a distinct(): we are only interested in finding one existing value.
  2. Add a column "difference" as the absolute value of col('timestamp') - lit(PREDEFINED_TIMESTAMP_VALUE) Now this new column represents the difference between the timestamp of each row and the target timestamp value.
  3. Select the row with the min of this new column. You just found the closest timestamp present in your dataset to the one you where looking for.
  4. Filter the original dataframe imposing timestamp == "TIMESTAMP_FOUND"

Approach 2

Pretty similar to the first approach, but directly working on your original dataset.

  1. Directly add a column "difference" as the absolute value of col('timestamp') - lit(PREDEFINED_TIMESTAMP_VALUE). Now this new column represents the difference between the timestamp of each row and the target timestamp value.
  2. Select the row(s) with the min of this new column. You just found the closest timestamp present in your dataset to the one you where looking for.

I think approach 1 should be more efficient, but approach 2 is more direct to implement. Let me know if it works.

Sergio
  • 71
  • 6
0

You can use Pyspark UDF in the following way to get the desired result similar to the one suggested by @Sergio. The initial dataframe I have is shown below:

data = [["job1", "2022-02-15T23:40:00.000+0000", [0.2,3.4,13.2]],["job1", "2022-02-15T23:10:00.000+0000", [0.1,2.9,11.2]],["job2", "2022-02-15T23:40:00.000+0000", [1.2,3.1,16.0]],["job1", "2022-02-15T23:20:00.000+0000", [0.4,0.4,16.2]],["job2", "2022-02-15T23:10:00.000+0000", [0.7,8.4,11.2]],["job2", "2022-02-15T23:20:00.000+0000", [0.3,1.5,19.1]],["job3", "2022-02-15T23:20:00.000+0000", [0.7,7.4,13.2]],["job3", "2022-02-15T23:40:00.000+0000", [0.7,7.4,13.2]],["job3", "2022-02-15T23:10:00.000+0000", [0.7,7.4,13.2]]]
columns = ["job_id", "timestamp", "item_values"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)

enter image description here

  • Now, Use the code as shown below to use UDF and filter as required:
key = "2022-02-15T23:18:00.000+0000"
frmt = "%Y-%m-%dT%H:%M:%S.%f%z"

def diff_ts(ip):
    from datetime import datetime

    t1 = datetime.strptime(key, frmt).timestamp()
    t2 = datetime.strptime(ip, frmt).timestamp()
    return t1-t2

from pyspark.sql.functions import col, udf,abs,min
from pyspark.sql.types import *

convertUDF = udf(lambda z: diff_ts(z),DoubleType())


df1=df.withColumn("new",abs(convertUDF(col("timestamp"))))
min_val = df1.select(min("new")).first()[0]
df1.filter(col("new")==min_val).select("job_id", "timestamp", "item_values").show(truncate=False)

enter image description here

Saideep Arikontham
  • 5,558
  • 2
  • 3
  • 11
  • Hi, thank you for your answer! Do you know how to get the same result when we don't have a key? I want to use date_trunc('hour', df.timestamp) command to use it as a key value.. – MMV May 05 '23 at 21:14