0

I am trying to subset, or filter, data into a defined time interval. Can you help me subset the following data into 2-minute time intervals? I have looked at Lubridate, split(), and cut() but cannot figure out how to properly do this.

I've looked at this post on SO, however it doesn't seem to be what I need.

Note that columns 1 and 2 are character classes, column 3 is a POSIXct class. If possible I'd like to have the solution use the datetime column (POSIXct).

         date  time            datetime use..kW.     gen..kW. Grid..kW.
120 12/31/2013 21:59 2013-12-31 21:59:00 1.495833 -0.003083333  1.495833
121 12/31/2013 21:58 2013-12-31 21:58:00 1.829583 -0.003400000  1.829583
122 12/31/2013 21:57 2013-12-31 21:57:00 1.977283 -0.003450000  1.977283
123 12/31/2013 21:56 2013-12-31 21:56:00 2.494750 -0.003350000  2.494750
124 12/31/2013 21:55 2013-12-31 21:55:00 2.218283 -0.003500000  2.218283
125 12/31/2013 21:54 2013-12-31 21:54:00 2.008283 -0.003566667  2.008283
126 12/31/2013 21:53 2013-12-31 21:53:00 2.010917 -0.003600000  2.010917
127 12/31/2013 21:52 2013-12-31 21:52:00 2.011867 -0.003583333  2.011867
128 12/31/2013 21:51 2013-12-31 21:51:00 2.015033 -0.003600000  2.015033
129 12/31/2013 21:50 2013-12-31 21:50:00 2.096550 -0.003850000  2.096550

The new subset would just take the data from every two minute interval and look like:

      date  time            datetime use..kW.     gen..kW. Grid..kW.
121 12/31/2013 21:58 2013-12-31 21:58:00 1.829583 -0.003400000  1.829583
123 12/31/2013 21:56 2013-12-31 21:56:00 2.494750 -0.003350000  2.494750
125 12/31/2013 21:54 2013-12-31 21:54:00 2.008283 -0.003566667  2.008283
127 12/31/2013 21:52 2013-12-31 21:52:00 2.011867 -0.003583333  2.011867
129 12/31/2013 21:50 2013-12-31 21:50:00 2.096550 -0.003850000  2.096550

For my data, I am actually going to be doing 5 and 15 minute intervals. But if I get a good solution for the data above and a 2 minute interval, I should be able to appropriately adjust the code to fit my needs.

Community
  • 1
  • 1
stokeinfo
  • 135
  • 1
  • 2
  • 8
  • It's not clear what you're trying to do, why it's time-related, or why `Data[c(FALSE,TRUE),]` doesn't work. – Joshua Ulrich Feb 10 '14 at 23:47
  • I would like to filter my data based on a time interval. In the example data I provided above, I only want to take the observations every 2 minutes - not compute any function every two minutes, just subset the observation at the time interval. Does that help? – stokeinfo Feb 10 '14 at 23:53
  • 1
    Simple subsetting rules, combined with regex, might suffice here. E.g. for a data.frame `d`: `d[grep('[02468]$', d$time), ]`. The pattern matches against any strings (times) ending in 0, 2, 4, 6, or 8. – jbaums Feb 10 '14 at 23:56

1 Answers1

2

Using cut and plyr::ddply:

groups <- cut(as.POSIXct(df$datetime), breaks="2 min")
library(plyr)
ddply(df, "groups", tail, 1)[, -1]
#         date  time            datetime use..kW.     gen..kW. Grid..kW.
# 1 12/31/2013 21:50 2013-12-31 21:50:00 2.096550 -0.003850000  2.096550
# 2 12/31/2013 21:52 2013-12-31 21:52:00 2.011867 -0.003583333  2.011867
# 3 12/31/2013 21:54 2013-12-31 21:54:00 2.008283 -0.003566667  2.008283
# 4 12/31/2013 21:56 2013-12-31 21:56:00 2.494750 -0.003350000  2.494750
# 5 12/31/2013 21:58 2013-12-31 21:58:00 1.829583 -0.003400000  1.829583

Or

arrange(ddply(df, "groups", tail, 1)[, -1], datetime, decreasing=TRUE)

if you want to sort it the other way round.

lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Thanks @lukeA! This appears to be close to what I need. However when I call ddply() it fills in the first two rows (21:50 and 21:52) with NAs. Do you have any idea why it's doing this? Also can you walk me through the syntax of your ddply function? – stokeinfo Feb 11 '14 at 18:01
  • NVM, the issue was that my df only had six rows (head of original data). Code works now for the small example. I am testing it with the original dataframe which is 299,466 observations, taking a while... Can you still walk me through the syntax of the ddply function? – stokeinfo Feb 11 '14 at 18:22
  • @stokeinfo well first, `ddply` splits `df` into sub data frames by `groups`, which contains the time interval for each observation. Then `ddply` applies the `tail` function with a parameter of 1 to each sub data frame. This returns only the 1 last row of each. After that, `ddply` binds the results together and returns a data frame. `groups` will be the first column, and the `[, -1]` excludes it. Dunno if this is really what you needed, but it was a try. – lukeA Feb 11 '14 at 18:32
  • thanks for the explanation @lukeA. It works great. Thanks to you I was able to widdle down my data from 478,000 (minute level observations) rows to only 20,899 (15 minute intervals). This will make my analysis much faster. For speed sake using a regex to subset like what jbaums alluded to could be faster, but I wanted to learn a solution that uses the POSIXct format - so your answer is much appreciated. Thanks! – stokeinfo Feb 11 '14 at 18:40
  • @stokeinfo glad it works. PS: The author of `plyr` lately released `dplyr`, which is much faster. Maybe you want to check that out, too: http://blog.rstudio.org/2014/01/30/dplyr-0-1-1/ . – lukeA Feb 11 '14 at 18:57