I have a variant of this question: Count values in a data set that exceed a threshold in R:
I have some temperature measurement at almost random time intervals. I want to find out on which (how many) days a specific threshold value had been exceeded (in some time frame). Obviously without aggregating I could get multiple hits for the same day (if the threshold was exceeded multiple times). However I don't want that.
A short sample of the original data frame looks like this:
Time Temp Humidity Notes
1 2015-05-18 16:00:00 26.5 NA <NA>
2 2015-06-01 15:00:00 26.5 NA <NA>
3 2015-06-02 16:00:00 28.0 NA <NA>
4 2015-06-03 16:00:00 28.0 NA <NA>
5 2015-06-03 17:00:00 30.0 60 <NA>
6 2015-06-05 07:00:00 23.0 NA <NA>
So I computed a Day
variable (POSIXlt
):
1 2015-05-18 16:00:00 26.5 NA <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5 NA <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0 NA <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0 NA <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0 60 <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0 NA <NA> 2015-06-05
I almost despaired trying to aggregate by day (I'm not showing all the variants I had tried):
with(t, aggregate(Temp ~ Day, data=t, FUN=max))
Error in model.frame.default(formula = Temp ~ Day, data = t) :
invalid type (list) for variable 'Day'
Only if I explicitly convert POSIXlt
to POSIXct
, it works (why have a class POSTXlt
that is treated as a list by aggregate?):
> with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
as.POSIXct(Day) Temp
1 2015-05-18 26.5
2 2015-06-01 26.5
3 2015-06-02 28.0
4 2015-06-03 30.0
5 2015-06-05 23.0
Unfortunately I've lost the other columns during aggregate. How can I preserve them?
Also I don't understand this:
> tt <-with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
> tt
as.POSIXct(Day) Temp
1 2015-05-18 26.5
2 2015-06-01 26.5
3 2015-06-02 28.0
4 2015-06-03 30.0
5 2015-06-05 23.0
> str(tt)
'data.frame': 5 obs. of 2 variables:
$ as.POSIXct(Day): POSIXct, format: "2015-05-18" "2015-06-01" ...
$ Temp : num 26.5 26.5 28 30 23
> tt$Temp > 25
[1] TRUE TRUE TRUE TRUE FALSE
> tt[tt$Temp > 25]
Error in `[.data.frame`(tt, tt$Temp > 25) : undefined columns selected
> tt[tt$Temp > 25,]
as.POSIXct(Day) Temp
1 2015-05-18 26.5
2 2015-06-01 26.5
3 2015-06-02 28.0
4 2015-06-03 30.0
> t$Temp > 25
[1] TRUE TRUE TRUE TRUE TRUE FALSE
> t[t$Temp > 25]
Time Temp Humidity Notes Day
1 2015-05-18 16:00:00 26.5 NA <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5 NA <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0 NA <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0 NA <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0 60 <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0 NA <NA> 2015-06-05
Why did aggregate()
change the structure of t
? Can someone explain what I'm missing?
For reference, there's the sample data set (that has another variable Tim
(difftime
) holding the measurement offset from the start of day) in dput()
format:
> dput(t)
structure(list(Time = structure(list(sec = c(0, 0, 0, 0, 0, 0
), min = c(0L, 0L, 0L, 0L, 0L, 0L), hour = c(16L, 15L, 16L, 16L,
17L, 7L), mday = c(18L, 1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L,
5L, 5L, 5L, 5L), year = c(115L, 115L, 115L, 115L, 115L, 115L),
wday = c(1L, 1L, 2L, 3L, 3L, 5L), yday = c(137L, 151L, 152L,
153L, 153L, 155L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST",
"CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
)), .Names = c("sec", "min", "hour", "mday", "mon", "year",
"wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt",
"POSIXt")), Temp = c(26.5, 26.5, 28, 28, 30, 23), Humidity = c(NA,
NA, NA, NA, 60, NA), Notes = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Day = structure(list(sec = c(0, 0, 0, 0, 0, 0), min = c(0L,
0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L), mday = c(18L,
1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L, 5L, 5L, 5L, 5L), year = c(115L,
115L, 115L, 115L, 115L, 115L), wday = c(1L, 1L, 2L, 3L, 3L,
5L), yday = c(137L, 151L, 152L, 153L, 153L, 155L), isdst = c(-1L,
-1L, -1L, -1L, -1L, -1L), zone = c("CEST", "CEST", "CEST",
"CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Tim = structure(c(16,
15, 16, 16, 17, 7), class = "difftime", units = "hours")), .Names = c("Time",
"Temp", "Humidity", "Notes", "Day", "Tim"), row.names = c(NA,
6L), class = "data.frame")