I need to pick a data with a timestamp hourly frequency. But since sometimes timestamp is not available with the exact hour so I need to pick the data with timestamp nearest to the time. This is the dataframe I have below.
| job_id| timestamp |item_values |
|:---- |:---------------------------:|:------------:|
| job1 | 2022-02-15T23:40:00.000+0000|[0.2,3.4,13.2]|
| job1 | 2022-02-15T23:10:00.000+0000|[0.1,2.9,11.2]|
| job2 | 2022-02-15T23:40:00.000+0000|[1.2,3.1,16.0]|
| job1 | 2022-02-15T23:20:00.000+0000|[0.4,0.4,16.2]|
| job2 | 2022-02-15T23:10:00.000+0000|[0.7,8.4,11.2]|
| job2 | 2022-02-15T23:20:00.000+0000|[0.3,1.5,19.1]|
| job3 | 2022-02-15T23:20:00.000+0000|[0.7,7.4,13.2]|
| job3 | 2022-02-15T23:40:00.000+0000|[0.7,7.4,13.2]|
| job3 | 2022-02-15T23:10:00.000+0000|[0.7,7.4,13.2]|
For example,
if I need to pick this time below
2023-04-25T22:00:00.000+0000
and it's not available in the table but 2023-04-25T22:10:00.000+0000
is the closet time to 2023-04-25T22:00:00.000+0000
, I want to pick the data with the timestamp 2023-04-25T22:10:00.000+0000
.
the output I want is,
| job_id| timestamp |item_values |
|:---- |:---------------------------:|:------------:|
| job1 | 2022-02-15T23:10:00.000+0000|[0.1,2.9,11.2]|
| job2 | 2022-02-15T23:10:00.000+0000|[0.7,8.4,11.2]|
| job3 | 2022-02-15T23:10:00.000+0000|[0.7,7.4,13.2]|
How can I achieve this using Pyspark? Any help is greatly appreciated!