0

TLDR: Is there a function/package in R, which allows plotting/comparison of two time-series data sets with different lengths? (Not plotting two time-sereies data lines, but using the values of each dataset as x and y respecitvely). This is what the resulting plot could look like:

Thermocouple plot

For this problem, let's imagine I have two sensors in a lab, simultaneously measuring temperature and voltage of a thermocouple. This data is then stored as time-series data. But because this data is measured on two different devices, the timestamps are different and the frequency at which the data is measured is different too. Now, I want to match this data to plot voltage against temperature (which is a standard plot when evaluating thermocouples). How would I do this?

I created some sample data:

x_datetime <- as.POSIXct(paste("2022-10-21 10:00:",sprintf("%02.0f", seq(0,59,4)), sep = ""))
x_values <- seq(3,17,1)
y_datetime <- as.POSIXct(paste("2022-10-21 10:00:",sprintf("%02.0f", seq(1,59,3)), sep = ""))
y_values <- seq(0.4, 4.2, 0.2)

x <- data.frame(x_datetime, x_values)
y <- data.frame(y_datetime, y_values)

A solution I could think of, is calculating the mean for a given timeframe (e.g. 10 seconds) and using this to plot the data. To calculate the average, I found a pretty solution here. For x, the code looks like this:

#Variant 1, using dplyr
library("dplyr")
x$time_bucket <- as.POSIXct(round(as.numeric(x$x_datetime)/10)*10, origin='1970-01-01')
x_means <- x %>% 
  group_by(time_bucket) %>%
  summarize(mean(x_values))

#Variant 2, using data.table and lubridate
library("data.table")
library("lubridate")
x_dat <- as.data.table(x)
x_dat <- x_dat[, lapply(.SD, mean), .(x_datetime = round_date(x_datetime, "10 seconds"))]

This solution would require manual selection of the required timeframe (in this example 10 seconds), by which we want to calculate the mean.

Since this problem looks like it should be a quite common one, I was thinking there should be a easier, more easily automatable and maybe less resource intensive way to compare those datasets.

As always, thank you for your help!

Jan
  • 157
  • 9
  • 1
    I don't know if it makes sense in your example, but perhaps have a look in `rolling joins` from `data.table`. With this you could merge the two data sets based on the time variable even if timestamps do not match 100%, but it would duplicate some of the `x_values`. – Gilean0709 Oct 21 '22 at 13:19
  • @Gilean0709 Thats a really interesting function! To avoid the duplication, I would then preferably roll y up to x, which would drop the surplus y values. This is a really good solution to maximize the resolution of the resulting data set, with the cost of sacrificing some of it's accuracy because values are dropped and not averaged. Thank you, I will further look into that :) – Jan Oct 21 '22 at 13:26

1 Answers1

2

When dealing with irregular timeseries, xts is your friend!

library(xts)

x_xts <- xts(x$x_values, order.by = x$x_datetime)
period.apply(x_xts, endpoints(x_xts, 'seconds', 10), mean) |> 
  align.time(10) |> 
  print()

y_xts <- xts(y$y_values, order.by = y$y_datetime)
period.apply(y_xts, endpoints(y_xts, 'seconds', 10), mean) |> 
  align.time(10) |> 
  print()

br00t
  • 1,440
  • 8
  • 10
  • 1
    This looks clean! For calculating the mean I prefer this version over the two others I've seen. Do you know how resource-intensive xts is compared to data.frame and data.table? If someone records data with 1000 Hz or similar, I might have to deal with files > 1Mio rows – Jan Oct 21 '22 at 14:15
  • 1
    I haven't benchmarked it, but my guess is that while the `xts` version looks cleaner, `data.table` would likely be the fastest. `xts` very nicely abstracts away a lot of misery associated with irregular timeseries but when working with very large datasets you may want to go with `data.table` – br00t Oct 21 '22 at 14:17