0

I have a variant of this question: Count values in a data set that exceed a threshold in R:

I have some temperature measurement at almost random time intervals. I want to find out on which (how many) days a specific threshold value had been exceeded (in some time frame). Obviously without aggregating I could get multiple hits for the same day (if the threshold was exceeded multiple times). However I don't want that.

A short sample of the original data frame looks like this:

                 Time Temp Humidity Notes
1 2015-05-18 16:00:00 26.5       NA  <NA>
2 2015-06-01 15:00:00 26.5       NA  <NA>
3 2015-06-02 16:00:00 28.0       NA  <NA>
4 2015-06-03 16:00:00 28.0       NA  <NA>
5 2015-06-03 17:00:00 30.0       60  <NA>
6 2015-06-05 07:00:00 23.0       NA  <NA>

So I computed a Day variable (POSIXlt):

1 2015-05-18 16:00:00 26.5       NA  <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5       NA  <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0       NA  <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0       NA  <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0       60  <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0       NA  <NA> 2015-06-05

I almost despaired trying to aggregate by day (I'm not showing all the variants I had tried):

with(t, aggregate(Temp ~ Day, data=t, FUN=max))
Error in model.frame.default(formula = Temp ~ Day, data = t) : 
  invalid type (list) for variable 'Day'

Only if I explicitly convert POSIXlt to POSIXct, it works (why have a class POSTXlt that is treated as a list by aggregate?):

> with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
  as.POSIXct(Day) Temp
1      2015-05-18 26.5
2      2015-06-01 26.5
3      2015-06-02 28.0
4      2015-06-03 30.0
5      2015-06-05 23.0

Unfortunately I've lost the other columns during aggregate. How can I preserve them?

Also I don't understand this:

> tt <-with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
> tt
  as.POSIXct(Day) Temp
1      2015-05-18 26.5
2      2015-06-01 26.5
3      2015-06-02 28.0
4      2015-06-03 30.0
5      2015-06-05 23.0
> str(tt)
'data.frame':   5 obs. of  2 variables:
 $ as.POSIXct(Day): POSIXct, format: "2015-05-18" "2015-06-01" ...
 $ Temp           : num  26.5 26.5 28 30 23
> tt$Temp > 25
[1]  TRUE  TRUE  TRUE  TRUE FALSE
> tt[tt$Temp > 25]
Error in `[.data.frame`(tt, tt$Temp > 25) : undefined columns selected
> tt[tt$Temp > 25,]
  as.POSIXct(Day) Temp
1      2015-05-18 26.5
2      2015-06-01 26.5
3      2015-06-02 28.0
4      2015-06-03 30.0
> t$Temp > 25
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
> t[t$Temp > 25]
                 Time Temp Humidity Notes        Day
1 2015-05-18 16:00:00 26.5       NA  <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5       NA  <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0       NA  <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0       NA  <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0       60  <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0       NA  <NA> 2015-06-05

Why did aggregate() change the structure of t? Can someone explain what I'm missing?

For reference, there's the sample data set (that has another variable Tim (difftime) holding the measurement offset from the start of day) in dput() format:

> dput(t)
structure(list(Time = structure(list(sec = c(0, 0, 0, 0, 0, 0
), min = c(0L, 0L, 0L, 0L, 0L, 0L), hour = c(16L, 15L, 16L, 16L, 
17L, 7L), mday = c(18L, 1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L, 
5L, 5L, 5L, 5L), year = c(115L, 115L, 115L, 115L, 115L, 115L), 
    wday = c(1L, 1L, 2L, 3L, 3L, 5L), yday = c(137L, 151L, 152L, 
    153L, 153L, 155L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST", 
    "CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    )), .Names = c("sec", "min", "hour", "mday", "mon", "year", 
"wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", 
"POSIXt")), Temp = c(26.5, 26.5, 28, 28, 30, 23), Humidity = c(NA, 
NA, NA, NA, 60, NA), Notes = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    Day = structure(list(sec = c(0, 0, 0, 0, 0, 0), min = c(0L, 
    0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L), mday = c(18L, 
    1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L, 5L, 5L, 5L, 5L), year = c(115L, 
    115L, 115L, 115L, 115L, 115L), wday = c(1L, 1L, 2L, 3L, 3L, 
    5L), yday = c(137L, 151L, 152L, 153L, 153L, 155L), isdst = c(-1L, 
    -1L, -1L, -1L, -1L, -1L), zone = c("CEST", "CEST", "CEST", 
    "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", 
    "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Tim = structure(c(16, 
    15, 16, 16, 17, 7), class = "difftime", units = "hours")), .Names = c("Time", 
"Temp", "Humidity", "Notes", "Day", "Tim"), row.names = c(NA, 
6L), class = "data.frame")
U. Windl
  • 3,480
  • 26
  • 54

1 Answers1

1

Here are three solutions using dplyr, data.table and a split-lapply combination without any package:

In every case convert Time to POSIXct and Day to Date (I called the dataset you gave sample here and choose 27 as a cutoff, so we have one Day with two matching rows.):

sample$Time <- as.POSIXct(sample$Time)
sample$Day  <- as.Date(sample$Day)

With dplyr, the verbs speak for themselves, this is why this is my favourite solution:

require(dplyr)

result <- sample %>% 
          group_by(Day) %>% 
          summarise(greater27=max(Temp > 27))

result
# # A tibble: 5 x 2
#          Day greater27
#       <date>     <int>
# 1 2015-05-18         0
# 2 2015-06-01         0
# 3 2015-06-02         1
# 4 2015-06-03         1
# 5 2015-06-05         0

With data.table, leave the first argument empty to select all lines, do the calculation in the second argument, and specify by as named argument:

require(data.table)

sample <- data.table(sample)
result <- sample[, .(greater27=max(Temp > 27)), by="Day"]

result
#           Day greater27
# 1: 2015-05-18         0
# 2: 2015-06-01         0
# 3: 2015-06-02         1
# 4: 2015-06-03         1
# 5: 2015-06-05         0

And with just base R, split the dataset by Day this gives you a list of data.frames then apply the anonymous function, finally rbind everything back together to one data.frame:

result <- do.call(rbind, 
                  lapply(split(sample, sample$Day),
                         function(x){
                           data.frame(
                             Day = x$Day[1],
                             greater27 = max(x$Temp > 27)
                           )
                         }
                    )
            )

result
#                   Day greater27
# 2015-05-18 2015-05-18         0
# 2015-06-01 2015-06-01         0
# 2015-06-02 2015-06-02         1
# 2015-06-03 2015-06-03         1
# 2015-06-05 2015-06-05         0
Richard Telford
  • 9,558
  • 6
  • 38
  • 51
snaut
  • 2,261
  • 18
  • 37
  • Could you please give at least one line and the number of rows of the corresponding results? – U. Windl Aug 09 '17 at 09:35
  • Edited to include output. – snaut Aug 09 '17 at 09:54
  • I also changed the `data.table` example, so that the column of the output is named correctly. – snaut Aug 09 '17 at 09:58
  • Thanks! So you produce a table with the time and an Boolean-type "selected" column, while my approach tried a row-select of the original data (in case I want to do more with those). – U. Windl Aug 09 '17 at 13:25
  • In the `dplyr` version you can just add a `filter` to the chain to select only the right rows. In the other solutions the best way to select the rows is probably to do it in two steps, create the boolean and then select by it. – snaut Aug 10 '17 at 07:30