I've got timestamped data recording how many observations of a certain type were made at a certain time. For the sake of illustration, suppose I'm a birdwatcher observing flocks of duck and geese, recording -- each time I observe a flock -- the time of observation, the number of birds in the flock, and the type of bird.
What I would like to do is aggregate this data and record how many birds of each type were observed, in total, in each 30-minute period. I've found the tsibble
package from tidyverts
(not to be confused with tidyverse
); tsibbles are (basically) time-series tibbles and can be grouped (e.g. by bird) as well as re-indexed.
Indices in tsibbles must be unique (that's timestamps in my case); that is understandable. However, this is enforced before it's possible to use group_by()
. If I happen to observe both ducks and geese at exactly the same time, I'm apparently out of luck.
Here is some sample code showing the problem:
library(tidyverse) # includes lubridate
library(tsibble)
N <- 100
set.seed(42)
# suppose we're observing ducks and geese between 8:00 and 18:00.
d <- as_datetime("2023-03-08 08:00:00")
times <- d + seconds(unique(round(sort(runif(N, min = 0, max = 36e3)))))
nObs <- 1 + rpois(length(times), lambda = 1)
birdIdx <- 1 + round(runif(length(times)))
birds <- c("Duck", "Goose")[birdIdx]
# Tibble of observations
waterfowl <- tibble(Timestamp = times, Count = nObs, Bird = as_factor(birds))
# Convert to tsibble (time series tibble) and aggregate on a 30-minute basis
waterfowl |>
as_tsibble(index = Timestamp) |>
group_by(Bird) |>
index_by(~ floor_date(., "30 minute")) |>
summarize(`Total birds` = sum(Count)) |>
print(n = Inf)
# Let's create a collision by observing both ducks and geese at the same time.
waterfowl |>
bind_rows(tibble(Timestamp = times[[1]], Count = 1, Bird = c("Duck", "Goose")[3 - birdIdx[[1]]])) |>
as_tsibble(index = Timestamp) |>
group_by(Bird) |>
index_by(~ floor_date(., "30 minute")) |>
summarize(`Total birds` = sum(Count)) |>
print(n = Inf)
The first conversion runs fine since there are, in fact, no duplicates in the observation timestamps; the unique()
when creating the times
vector guarantees this.
The second conversion where I deliberately introduce a duplicate timestamp does not, and instead produces
Error in `validate_tsibble()`:
! A valid tsibble must have distinct rows identified by key and index.
ℹ Please use `duplicates()` to check the duplicated rows.
Is there a way around this? How do I handle this in an efficient manner, one that I can use when I have tens of millions of rows and several grouping variables with many levels (and with many combinations of levels not occurring)?
Thank you!