0

I've got timestamped data recording how many observations of a certain type were made at a certain time. For the sake of illustration, suppose I'm a birdwatcher observing flocks of duck and geese, recording -- each time I observe a flock -- the time of observation, the number of birds in the flock, and the type of bird.

What I would like to do is aggregate this data and record how many birds of each type were observed, in total, in each 30-minute period. I've found the tsibble package from tidyverts (not to be confused with tidyverse); tsibbles are (basically) time-series tibbles and can be grouped (e.g. by bird) as well as re-indexed.

Indices in tsibbles must be unique (that's timestamps in my case); that is understandable. However, this is enforced before it's possible to use group_by(). If I happen to observe both ducks and geese at exactly the same time, I'm apparently out of luck.

Here is some sample code showing the problem:

library(tidyverse) # includes lubridate
library(tsibble)

N <- 100
set.seed(42)

# suppose we're observing ducks and geese between 8:00 and 18:00.
d       <- as_datetime("2023-03-08 08:00:00")
times   <- d + seconds(unique(round(sort(runif(N, min = 0, max = 36e3)))))
nObs    <- 1 + rpois(length(times), lambda = 1)
birdIdx <- 1 + round(runif(length(times)))
birds   <- c("Duck", "Goose")[birdIdx]

# Tibble of observations
waterfowl <- tibble(Timestamp = times, Count = nObs, Bird = as_factor(birds))

# Convert to tsibble (time series tibble) and aggregate on a 30-minute basis
waterfowl |>
    as_tsibble(index = Timestamp) |>
    group_by(Bird) |>
    index_by(~ floor_date(., "30 minute")) |>
    summarize(`Total birds` = sum(Count)) |>
    print(n = Inf)

# Let's create a collision by observing both ducks and geese at the same time.
waterfowl |>
    bind_rows(tibble(Timestamp = times[[1]], Count = 1, Bird = c("Duck", "Goose")[3 - birdIdx[[1]]])) |>
    as_tsibble(index = Timestamp) |>
    group_by(Bird) |>
    index_by(~ floor_date(., "30 minute")) |>
    summarize(`Total birds` = sum(Count)) |>
    print(n = Inf)

The first conversion runs fine since there are, in fact, no duplicates in the observation timestamps; the unique() when creating the times vector guarantees this.

The second conversion where I deliberately introduce a duplicate timestamp does not, and instead produces

Error in `validate_tsibble()`:
! A valid tsibble must have distinct rows identified by key and index.
ℹ Please use `duplicates()` to check the duplicated rows.

Is there a way around this? How do I handle this in an efficient manner, one that I can use when I have tens of millions of rows and several grouping variables with many levels (and with many combinations of levels not occurring)?

Thank you!

BestGirl
  • 319
  • 1
  • 13

1 Answers1

1

The {tsibble} package is strict when it comes to measurements at points in time. Each series should be uniquely identified in time, if there are duplicate time points it is like saying we at exactly 2023-03-08 08:04:44 there were 2 geese and 3 geese. It is unclear which is accurate, but it isn't both 2 and 3 geese at that point in time. Your specific example is relatively easy to fix, as you are observing two different series - the number of geese and the number of ducks. You can identify each series by key variables.

# as above from MRE

# Let's create a collision by observing both ducks and geese at the same time.
waterfowl_tsbl <- waterfowl |>
  bind_rows(tibble(Timestamp = times[[1]], Count = 1, Bird = c("Duck", "Goose")[3 - birdIdx[[1]]])) |> 
  as_tsibble(index = Timestamp, regular = FALSE, key = Bird)
waterfowl_tsbl
#> # A tsibble: 100 x 3 [!] <UTC>
#> # Key:       Bird [2]
#>    Timestamp           Count Bird 
#>    <dttm>              <dbl> <chr>
#>  1 2023-03-08 08:00:09     1 Duck 
#>  2 2023-03-08 08:04:44     4 Duck 
#>  3 2023-03-08 08:22:28     4 Duck 
#>  4 2023-03-08 08:23:22     3 Duck 
#>  5 2023-03-08 08:49:28     2 Duck 
#>  6 2023-03-08 08:51:22     1 Duck 
#>  7 2023-03-08 08:53:59     2 Duck 
#>  8 2023-03-08 09:23:14     2 Duck 
#>  9 2023-03-08 09:24:17     2 Duck 
#> 10 2023-03-08 09:34:45     2 Duck 
#> # ... with 90 more rows

Created on 2023-03-09 with reprex v2.0.2

Note that I've used the Bird column as the key variable, and I've set the time interval to be irregular as you have irregular observations (important for handling implicit missing values and producing forecasts).

If you were more interested in the total geese and ducks over 30 minute intervals, you can now summarise the tsibble as you have done above:

waterfowl_tsbl |>
  group_by(Bird) |>
  index_by(Time = floor_date(Timestamp, "30 minute")) |>
  summarize(`Total birds` = sum(Count)) |>
  print(n = Inf)
#> # A tsibble: 36 x 3 [30m] <UTC>
#> # Key:       Bird [2]
#>    Bird  Time                `Total birds`
#>    <chr> <dttm>                      <dbl>
#>  1 Duck  2023-03-08 08:00:00            12
#>  2 Duck  2023-03-08 08:30:00             5
#>  3 Duck  2023-03-08 09:00:00             4
#>  4 Duck  2023-03-08 09:30:00             4
#>  5 Duck  2023-03-08 10:00:00             4
#>  6 Duck  2023-03-08 10:30:00             6
#>  7 Duck  2023-03-08 11:00:00             8
#>  8 Duck  2023-03-08 11:30:00            10
#>  9 Duck  2023-03-08 12:00:00             1
#> 10 Duck  2023-03-08 12:30:00             6
#> 11 Duck  2023-03-08 13:00:00             6
#> 12 Duck  2023-03-08 13:30:00             5
#> 13 Duck  2023-03-08 14:00:00             6
#> 14 Duck  2023-03-08 14:30:00             3
#> 15 Duck  2023-03-08 15:00:00             5
#> 16 Duck  2023-03-08 15:30:00             4
#> 17 Duck  2023-03-08 16:00:00            17
#> 18 Duck  2023-03-08 16:30:00             1
#> 19 Duck  2023-03-08 17:00:00            13
#> 20 Duck  2023-03-08 17:30:00             5
#> 21 Goose 2023-03-08 08:00:00             8
#> 22 Goose 2023-03-08 09:00:00             6
#> 23 Goose 2023-03-08 09:30:00             2
#> 24 Goose 2023-03-08 10:00:00             6
#> 25 Goose 2023-03-08 10:30:00             5
#> 26 Goose 2023-03-08 11:30:00             5
#> 27 Goose 2023-03-08 12:00:00             5
#> 28 Goose 2023-03-08 12:30:00             2
#> 29 Goose 2023-03-08 13:00:00             3
#> 30 Goose 2023-03-08 13:30:00             2
#> 31 Goose 2023-03-08 14:00:00             4
#> 32 Goose 2023-03-08 14:30:00            10
#> 33 Goose 2023-03-08 15:00:00             7
#> 34 Goose 2023-03-08 15:30:00             1
#> 35 Goose 2023-03-08 17:00:00             7
#> 36 Goose 2023-03-08 17:30:00             4

Created on 2023-03-09 with reprex v2.0.2

If you do indeed have an invalid tsibble (different counts of geese at the same point in time for instance), you should carefully fix this issue before converting the data to a tsibble.

  • Thanks. Thinking about it, I cannot even rule out that there might be more than one observation of the same type of bird at the same time: think observations made at different sites. I suppose that would be easy to fix by using the site as an additional key? – BestGirl Mar 09 '23 at 08:05
  • Another issue is the granularity of the timestamp. In my data it's one second, but observations can in principle be made at a finer granularity. I cannot a priori rule out that there will be two observations of the same type of bird at the same site that occur within, say, 0.1 seconds of one another and that would end up get the same timestamp; nor can I control the process that generates this data to ensure that timestamps are more granular. So `tsibble`'s restriction may remain a genuine issue that can't easily be resolved with additional keys here. – BestGirl Mar 09 '23 at 08:08
  • I was able to get around the problem of duplicated timestamps within each group (bird/site) by grouping by timestamp, bird and site and summing the numbers of birds observed (`summarize()`) before converting to a tsibble. With a large number of groups (most of which have size 1 because timestamps aren't duplicated) this takes a long time using dplyr though. – BestGirl Mar 09 '23 at 09:59
  • Yes, you can have multiple keys and `site` should be one in your case. As for timestamps, you can use sub-second precision and tsibble will be okay with it - as long as there aren't multiple observations of the same series (site,bird) at the exact same point in time. As for if it is reasonable to add up birds counted mere 0.1 seconds apart as if they were different birds, is up to you. Adding up duplicates is one way of fixing the issue, but it is strange to have multiple observations of the same bird at the same site at the exact same second - these might need more investigation than adding – Mitchell O'Hara-Wild Mar 09 '23 at 22:56
  • you're right, it would be strange if you were actually observing waterfowl (but this is just a made-up example). Adding up works, and I hope I won't have to do it too often --- that, or I'll have to investigate just why it's taking dplyr so long (think 20-minute operations, where pretty much everything else I do with my data takes a few seconds at most), or look for an alternative. People keep recommending `data.table`, but I've shied away from it so far because I'm kinda used to the tidyverse approach by now. – BestGirl Mar 10 '23 at 08:03
  • 1
    It could be faster to split the dataset into two parts and recombine afterwards. Split it into >2 rows per observation and 1 row per observation. Add up the duplicates in the >2 rows dataset, then `bind_rows()` back to the 1 row per observation and sort the data. – Mitchell O'Hara-Wild Mar 10 '23 at 22:38