I am using databricks delta - Tempo to aggregate timeseries data. System has data for each seconds and I need to aggregate the data and shows the data at hours level of each day and day level data.
I have a dataset in following format where data is for each second. Current code to retrieve hour level aggregate.
Time (String) | Pressure |
---|---|
02-01-2023 00:00:00 | 2720.78 |
02-01-2023 00:00:01 | 2720.7 |
This time is in String format and Pressure in double. I am looking to aggregate the data at hour level and day level using Databricks Delta - Tempo using Pyspark
The approach I used was as follows: Since TSDF needs a partition and timeseries column I converted the Time (String) column to timestamp and using concat created day+hour data column.
csvdata_df = spark.read.csv("resources\PTGauge.csv", inferSchema=True, header=True).withColumn("TimeStampUTC", to_timestamp("Time (Utc)","MM-dd-yyyy HH:mm:ss")).withColumn("PartitionDayHourMin", concat(lpad(dayofmonth("TimeStampUTC"),2,"0"), lpad(hour("TimeStampUTC"),2,"0").cast("string")))
This is the how the resultant data frame is:
Time_Utc| Pressure| TimeStampUTC | PartitionDayHourMin|
+-------------------+--------+-----------+---------------+-------------------+-------------------+ |02-01-2023 00:00:00| 2720.78| 2023-01-02 00:00:00| 0200| |02-01-2023 00:00:01| 2720.7| 2023-01-02 00:00:01| 0200|
tempo_tsdf = TSDF(csvdata_df, ts_col="TimeStampUTC", partition_cols = ["PartitionDayHourMin"])
interpolated_tsdf1 = tempo_tsdf.interpolate(
freq="day",
func="mean",
target_cols= ["Pressure"],
method="linear"
)
print(">>> Using TDSF interpolate show")
interpolated_tsdf1.df.show(100)
+-------------------+-------------------+------------------+ |PartitionDayHourMin| TimeStampUTC| Pressure| +-------------------+-------------------+------------------+ | 0100|2023-01-31 05:30:00|2700.7904972222236| | 0101|2023-01-31 05:30:00|2730.1101388888983| | 0102|2023-01-31 05:30:00|2772.0431388888937| | 0103|2023-01-31 05:30:00| 2715.572372222218| | 0104|2023-01-31 05:30:00|2759.0048597611717| | 0105|2023-01-31 05:30:00|2797.0757055555596| | 0105|2023-02-01 05:30:00|2778.9690166666683| | 0106|2023-02-01 05:30:00| 2754.678494026128|
So the PartitionDayHourMin shows it is split on day and hour but the TimeStampUTC does not show appropriate value to the respective data hour and min to do the plot of the data. Can someone guide what I am missing here? Also I am not sure why it is take freq=day to give this result and not freq=hr as I am looking for an hour data for a day.
Also can we do the same using resample? I tried the following but the result wasn't as per expectation?
start_ts = '2023-02-01T00:00:00.000+0000'
end_ts = '2023-02-01T23:59:59.000+0000'
interval_inclusive = tempo_tsdf.between(start_ts, end_ts)
resampled_sdf = interval_inclusive.resample(freq='hr', func='mean')