3

Answering this question Temperature curve in R I came across a weird behavior of a dplyr::filter - lubridate::minute combination.

See the test data dta below. dta$time is a lubridate::hhmm format.

library(lubridate)
library(dplyr)

dta$Time <- hm(dta$Time)

To get only rows with full hours (i.e. 0 minutes) one can subset using lubridate::minute like this:

dta[minute(dta$Time) == 0,]
#        Time    Temp1    Temp2
# 1        0S 18.62800 18.54458
# 7  1H 0M 0S 18.45733 18.22625
# 13 2H 0M 0S 18.33258 18.04142

However, when using dplyr's filter, like this

dta %>% filter(minute(Time) == 0)
#     Time    Temp1    Temp2
# 1     0S 18.62800 18.54458
# 2 10M 0S 18.45733 18.22625
# 3 20M 0S 18.33258 18.04142

the result does not really fit the expectation. (UPDATE: The values of Temp1 and Temp2 are correct, only Time is corrupt... Thanks to @Brian btw for giving this hint. )

Additionally this warning is returned:

Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs

This was also reported and somehow solved here, but only by coercion, which seems to remove the fun (and very readable) part of lubridate.

Question: Is there any way (to date) to dplyr::filter lubridate::hhmm(ss) formats without coercing it to character etc.?

Update:

It seems that the vector created by

minute(dta$Time)
# [1]  0 10 20 30 40 50  0 10 20 30 40 50  0

looks like a numeric vector, yet seems to have some mysterious characteristics.

Furthermore, as @Lyngbakr pointed out even the comparison with == does not have the usual characteristics as a "normal" logical vector.

tst <- minute(dta$Time) == 0 
dta %>% filter(tst)

will result in the same strange Time column.

Sample data:

dta <- read.table(text = "     Time        Temp1       Temp2
                           1  00:00     18.62800    18.54458
                           2   00:10     18.60025    18.48283
                           3   00:20     18.57250    18.36767
                           4   00:30     18.54667    18.36950
                           5   00:40     18.51483    18.36550
                           6   00:50     18.48325    18.34783
                           7   01:00     18.45733    18.22625
                           8   01:10     18.43767    18.19067
                           9   01:20     18.41583    18.22042
                           10  01:30     18.39608    18.21225
                           11  01:40     18.37625    18.18658
                           12  01:50     18.35633    18.05942
                           13  02:00     18.33258    18.04142", header = T)
loki
  • 9,816
  • 7
  • 56
  • 82
  • 1
    Interesting. It seems that `filter` cannot deal with `Formal class 'Period'`. If you gather first and try to filter (`dta %>% gather(var, val, -Time) %>% filter(minute(Time) == 0)`) it throws an error `Error in filter_impl(.data, quo) : Result must have length 26, not 13` – Sotos Aug 21 '17 at 16:30
  • 3
    What I find particularly strange is that even if you use an intermediate variable, it persists. For example, `tst <- minute(dta$Time) == 0` and then `dta %>% filter(tst)`. As far as I can see, `tst` is just a regular logical vector. – Dan Aug 21 '17 at 16:42
  • All the columns **except** `Time` are being correctly `filter`ed. I tried making an intermediate variable `mutate(mins = minutes(Time))` and filtering on that, and the correct rows were returned, but not for the `Time` column. – Brian Aug 21 '17 at 16:57

1 Answers1

1

I don't know why this works, but it does: The Time column needs to be of type datetime, not Period.

dta %>% 
  mutate(Time = as_datetime(hm(Time))) %>% 
  filter(minute(Time) == 0) 
                 Time    Temp1    Temp2
1 1970-01-01 00:00:00 18.62800 18.54458
2 1970-01-01 01:00:00 18.45733 18.22625
3 1970-01-01 02:00:00 18.33258 18.04142

This has the side effect of just adding the time in the Time column to the Unix epoch, so I would advise always including an actual date when you're using time-only data.

If this were minutes elapsed since the start of an experiment, it doesn't really matter that much, you don't have to display the 1970-01-01 part.

Brian
  • 7,900
  • 1
  • 27
  • 41
  • Thanks for your answer to this post. However, since there is still some coercion to a datetime format necessary, I will not yet mark it as accepted, to keep the discussion going. – loki Aug 21 '17 at 19:24
  • @loki, I think S4 classes, like Period, just don't play well with dataframes. They have six slots, instead of being a single-value in a vector. From looking at `?lubridate::\`Period-class\``, I gather that any arithmetic on them is done under the hood by coercing to seconds first and then back-coercing. – Brian Aug 21 '17 at 19:31
  • Yes, seems like that. but as @Lyngbakr pointed out It seems like even if `str(minute(dta$Time)) num [1:13] 0 10 20 30 40 50 0 10 20 30 ..` is a numeric vector, it is not handled like one (or a logical when comparing it via `==`). – loki Aug 21 '17 at 20:26
  • Yes, and when I tried that, it returned the correct rows of the data (Temp1 and Temp2), but not the correct rows of Time. – Brian Aug 21 '17 at 20:28
  • Now, as you point that out, it is getting even more confusing. I will update the Q, to see if anyone has an idea on that... Thanks for pointing that out. – loki Aug 21 '17 at 20:30