Aggregating by multiple variables including time

Question

I'm new to programming in R (and for that matter, programming at all...) and I'm trying to do some data analysis for a project for my class. I have some data that looks like this:

Id	Time	Heartrate
1341231	2016-04-12 07:23:30	95
1341231	2016-04-12 07:23:40	101
1341231	2016-04-12 07:23:50	92
1341231	2016-04-12 07:24:00	87
2342383	2016-04-12 07:23:30	60

This is data from wearable fitness trackers, in 5 or 10 second intervals. It's a fairly large dataset, with more than 2 million entries. What I would like to do is: for each Id (aka each user), summarize the seconds data by hour, returning the average heart rate for each hour. So I’d like output that looks something like this:

Id	Time	Heartrate
1341231	2016-04-12 07:00	95
1341231	2016-04-12 08:00	82
1341231	2016-04-12 09:00	80
1341231	2016-04-12 10:00	100
2342383	2016-04-12 07:00	65

The dates were originally strings, so I parsed them with lubridate. But after that, things started to go awry.

So, I turn to my best technique: copy-pasting half-understood code.

First, I tried test_df <- aggregate(Heartrate ~ format(as.POSIXct(sechr$Time), "%m-%d-%y %H"), data=sechr, mean)

but that was no good. As I quickly realized, that dropped the Id out completely, summarizing my data in a more or less useless way.

So, next I tried various formulations of aggregate, which doesn't seem to take another argument for another variable, and then experimented with summarize and group_by, such as below:

testdf3 <- sechr %>% group_by(c(Time ~ format(as.POSIXct(sechr$Time))), "%m-%d-%y %H", Id) %>% summarise(avg_hr=sum(Heartrate))

Needless to say, basically guessing didn't work at all. I produced a lot of errors and several goofy, useless dataframes.

Basically, what I need is a way to say "for each distinct Id, give me the mean of each hour." I think using xts is the way to go? Maybe? but I'm puzzled about how to do what I'm trying to do.

use `aggregate(Heartrate~Id+h, transform(sechr, h = format(Tme, "%F %H")), mean)` — Onyambu, Apr 14 '23 at 04:02
Check the class of column `Time` and if it already is of class `"POSIXct"`, try `aggregate(Heartrate ~ Id + format(Time, "%Y-%m-%d %H"), data = sechr, mean)`. — Rui Barradas, Apr 14 '23 at 04:32
Thank you both for your commentary! I couldn't quite get aggregate to these ways, but it may well be user error. The groupby + summarise answer from below works well for this sort of problem. — stargazer lily, Apr 14 '23 at 05:20

NicChr · Answer 1 · 2023-04-14T12:36:11.863

An approach using my package timeplyr. As the name suggests, it is a time extension to dplyr and so works naturally with tidyverse syntax.

# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
library(dplyr)
library(lubridate)
sechr %>%
  mutate(Time = ymd_hms(Time)) %>%
  group_by(Id) %>%
  time_summarise(avg_hr = mean(Heartrate), 
                 time = Time, by = "hour")
#> # A tibble: 2 x 3
#> # Groups:   Id [2]
#>        Id Time                avg_hr
#>     <int> <dttm>               <dbl>
#> 1 1341231 2016-04-12 07:23:30   93.8
#> 2 2342383 2016-04-12 07:23:30   60

^{Created on 2023-04-14 with reprex v2.0.2}

time_summarise() is more flexible as you can aggregate by any time unit.

sechr2 <- sechr %>%
  mutate(Time = ymd_hms(Time)) %>%
  group_by(Id)
sechr2 %>%
  time_summarise(avg_hr = mean(Heartrate), 
                 time = Time, by = "30 seconds")
#> # A tibble: 3 x 3
#> # Groups:   Id [2]
#>        Id Time                avg_hr
#>     <int> <dttm>               <dbl>
#> 1 1341231 2016-04-12 07:23:30     96
#> 2 1341231 2016-04-12 07:24:00     87
#> 3 2342383 2016-04-12 07:23:30     60
sechr2 %>%
  time_summarise(avg_hr = mean(Heartrate), 
                 time = Time, by = "minute")
#> # A tibble: 2 x 3
#> # Groups:   Id [2]
#>        Id Time                avg_hr
#>     <int> <dttm>               <dbl>
#> 1 1341231 2016-04-12 07:23:30   93.8
#> 2 2342383 2016-04-12 07:23:30   60
sechr2 %>%
  time_summarise(avg_hr = mean(Heartrate), 
                 time = Time, by = "30 minutes")
#> # A tibble: 2 x 3
#> # Groups:   Id [2]
#>        Id Time                avg_hr
#>     <int> <dttm>               <dbl>
#> 1 1341231 2016-04-12 07:23:30   93.8
#> 2 2342383 2016-04-12 07:23:30   60

You can also complete missing gaps in time if for some reason you didn't have data for certain intervals.

sechr2 %>%
  time_complete(time = Time, to = min(Time) + years(1))
#> Assuming a time granularity of 10 seconds
#> # A tibble: 6,307,202 x 3
#> # Groups:   Id [2]
#>         Id Time                Heartrate
#>  *   <int> <dttm>                  <int>
#>  1 1341231 2016-04-12 07:23:30        95
#>  2 1341231 2016-04-12 07:23:40       101
#>  3 1341231 2016-04-12 07:23:50        92
#>  4 1341231 2016-04-12 07:24:00        87
#>  5 1341231 2016-04-12 07:24:10        NA
#>  6 1341231 2016-04-12 07:24:20        NA
#>  7 1341231 2016-04-12 07:24:30        NA
#>  8 1341231 2016-04-12 07:24:40        NA
#>  9 1341231 2016-04-12 07:24:50        NA
#> 10 1341231 2016-04-12 07:25:00        NA
#> # ... with 6,307,192 more rows

score 0 · Accepted Answer · answered Apr 14 '23 at 04:37

You weren’t too far off with your group_by() %>% summarize() attempt; I cleaned it up a bit, and specified timezone in format() using lubridate::tz().

library(dplyr)
library(lubridate)

sechr %>% 
  group_by(Id, Time = format(Time, "%m-%d-%y %H", tz = tz(Time))) %>%
  summarise(avg_hr = mean(Heartrate)) %>%
  ungroup()

# A tibble: 2 × 3
       Id Time        avg_hr
    <dbl> <chr>        <dbl>
1 1341231 04-12-16 07   93.8
2 2342383 04-12-16 07   60

An alternative to format(), if you’d like to keep Time as a datetime object, would be to round down to the hour using lubridate::floor_date().

sechr %>% 
  group_by(Id, Time = floor_date(Time, "hour")) %>%
  summarise(avg_hr = mean(Heartrate)) %>%
  ungroup()

# A tibble: 2 × 3
       Id Time                avg_hr
    <dbl> <dttm>               <dbl>
1 1341231 2016-04-12 07:00:00   93.8
2 2342383 2016-04-12 07:00:00   60

Thank you so much! I'm glad I wasn't barking up the wrong tree entirely. This a) works, and b) makes sense so that I understand how the group by and summarise functions work together a bit more. — stargazer lily, Apr 14 '23 at 05:14

score 0 · Answer 3 · answered Apr 15 '23 at 10:24

This is a rather makeshift solution using {data.table}.

I have replicated the first few rows to test it out.

sechr <- data.table(Id = c("1341231", "1341231", "1341231", "1341231", "2342383"),
                    Time = c("2016-04-12 07:23:30", "2016-04-12 07:23:40",
                             "2016-04-12 07:23:50", "2016-04-12 07:24:00",
                             "2016-04-12 07:23:30"),
                    Heartrate = c(95, 101, 92, 87, 60))

Here I determine the hour by referring to the number right before the first occurrence of colon (and the date, of course).

sechr[, TimeH := gsub(":.*", ":00:00", Time)]
sechr_ByIDHour <- sechr[, .(Heartrate = mean(Heartrate)), .(Id, TimeH)]

Aggregating by multiple variables including time

3 Answers3