1

It's probably really simple. In the first case, using presidential data, I can filter by either years or years 2. And I get the same result.

However, when I use posixct data and try to filter in a similar way I run into problems.

When I write

school_hours2<-as.character(c(07:18))

I can see the values in school_hours 2 are "7", "8","9" etc whereas in school_hours they are "07" "08" "09" etc

EDIT: I think this explains that difference then? EDIT: I can see the problem comparing integer:character, and even when I write the vector as.character the values in the vector do not match what I want.

What I'd like is to be able to filter by school_hours2. As that would mean I could think "i'd like to filter between these two times" and put the upper and lower bounds in. Rather than having to write all the interval points in between. How do I get this?

Why is filtering by "Y" easier than filtering by "H"?

library (tidyverse)
#some data - filtering works
data(presidential)
head(presidential)
str(presidential)
presidential%>%filter(format(as.Date(start),"%Y")<=2005)
years<-c('1979', '1980', '1981', '1982', 
                '1983', '1984', '1985', '1986',
                '1987', '1988', '1989', '1990'
)
years2<-c(1950:1990)
presidential%>%filter(format(as.Date(start),"%Y")%in% years2)
presidential%>%filter(format(as.Date(start),"%Y")%in% years)


#some date time data - filtering.
test_data<-sample(seq(as.POSIXct('2013/01/01'), as.POSIXct('2017/05/01'), by="day"), 1000)
td<-as.data.frame(test_data)%>%mutate(id = row_number())

school_hours<-c('07', '08', '09', '10', 
                '11', '12', '13', '14',
                '15', '16', '17', '18'
                  )
school_hours2<-c(07:18)
school_years<-c(2015,2016,2017)
school_years2<-c(2015:2017)

str(td)
test1<-td%>%
  filter(id >=79)

schools<-td%>%
  filter(format(test_data,'%H') %in% school_hours)

schools2<-td%>%
  filter(format(test_data,'%H') %in% school_hours2)

schools3<-td%>%
  filter(format(test_data,'%Y')==2017)

schools4<-td%>%
  filter(format(test_data,'%Y') %in% school_years)

schools5<-td%>%
  filter(format(test_data,'%Y') %in% school_years2)

Here's my question: In the code above, when I try to filter td (which contains posixct data) using school_hours or school_hours2 I get zero data returned. Why?

What I'd like to be able to do is instead of writing

school_hours<-c('07', '08', '09', '10', 
                    '11', '12', '13', '14',
                    '15', '16', '17', '18'
                      )

I'd write

school_hours2<-c(07:18)

Just like I have for school_years and the filtering would work. This doesn't work

schools2<-td%>%
  filter(format(test_data,'%H') %in% school_hours2)

This does work

schools5<-td%>%
  filter(format(test_data,'%Y') %in% school_years2)

WHY?

I ask because: I've used something similar to filter my real data, which I can't share, and I get a discrepancy.

When I use school_hours (which is a character) I generate 993 records and the first time is 07:00. When I use school_hours2 (which is an integer) I generate 895 records and the first time is 10:00. I know - "without the data we can't make any evaluation" but what I can't work out is why the two different vector filters work differently. Is it because school_hours contains characters and school_hours2 integers?

EDIT: I changed the test_data line to:

#some date time data - filtering.
test_data<-as.POSIXct(sample(seq(1127056501, 1127056501), 1000),origin = "1899-12-31",tz="UTC")

it's still problematic:

schools<-td%>%
  filter(format(test_data,'%H') %in% school_hours)

generates 510 rows

schools2<-td%>%
  filter(format(test_data,'%H') %in% school_hours2)

generates 379 rows

All of the data I'm really interested looks like this 1899-12-31 23:59:00

(where the last 6 digits represent a 24 hr clock time)

All I'm really trying to do is convert the time from this 1899-12-31 07:59:00 to the hour (7)

and then

use

school_hours2<-c(07:18)

as a filter. But will the hour generated by the conversion of 1899-12-31 07:59:00

be 07 or 7

Because if it's 07, then school_hours2<-c(07:18) generates 7 and school_hours2<-as.character(c(07:18)) generates '7'

How do I get around this?

EDIT: LIKE THIS: R: how to filter a timestamp by hour and minute?

td1<-td%>%mutate(timestamp_utc = ymd_hms(test_data,tz="UTC"))%>%
  mutate(hour = hour(timestamp_utc))%>%
filter(hour(timestamp_utc) %in% school_hours)

td2<-td%>%mutate(timestamp_utc = ymd_hms(test_data,tz="UTC"))%>%
  mutate(hour = hour(timestamp_utc))%>%
  filter(hour(timestamp_utc) %in% school_hours2)

td3<-td%>%
  mutate(hour = hour(test_data))%>%
  filter(hour(test_data) %in% school_hours2)
damo
  • 463
  • 4
  • 14
  • Ideally, you should not compare character values with integer values. Comparing `"07"` with `7` is not correct. – Ronak Shah Oct 01 '19 at 10:48
  • Ah. I see. It's not clear in my question that i've worked out comparing the integer:character isn't working! sorry. I'll make that clear. – damo Oct 01 '19 at 11:24
  • Sorry, what is the question here? – Ronak Shah Oct 01 '19 at 11:31
  • see edit. It's why filtering by "Y" does work but filtering by "H" doens't when I'm looking at posixct data. – damo Oct 01 '19 at 11:48
  • Thank you for an elaborated question. The reason of the troubles on the test data is that it just do not contain any school hours. The `test_data` were generated with a `"day"` step, which means there is only 00 and 23 hours, as you may check with `unique(format(test_data, "%H"))`. Indeed, I can't guess what is wrong with your original data, but probably there are some similar issues with data content or format? – Ekatef Oct 01 '19 at 13:02
  • thanks for the explanation. I tried to change my question, and I appreciate the input! any advice on the edits would be great. thank you. I'm wondering if it would be simpler to somehow return the hour number and then filter based on that? – damo Oct 01 '19 at 13:47

1 Answers1

0

After a lot of mucking around and talking to myself in my question

I found this thread: filtering a dataset by time stamp

and it helped me to realise how to isolate the hour in the time stamp and then use that to filter the data properly.

the final answer is to isolate the hour by this

filter(hour(timestamp_utc) %in% school_hours2)
damo
  • 463
  • 4
  • 14