0

I am trying to filter a dataset of depth data and a corresponding date. The Time column is a POSIXct format = "%Y%m%d%H%M%S". This is how my data looks like:

Depth Time        
0.1   2018-06-24 01:26:40  
0.2   2018-06-24 01:26:41  
0.2   2018-06-24 01:26:56  
0.1   2018-06-24 01:26:57  
0.1   2018-06-24 01:26:58  
0.1   2018-06-24 01:26:59  
0.1   2018-06-24 01:27:14  
0.1   2018-06-24 01:27:15  
0.1   2018-06-24 01:27:16  
0.1   2018-06-24 01:27:17  
0.1   2018-06-24 01:27:30  

I would like to create a dataframe containing the same data but only with a new entry every 15 seconds. My data is sometimes continuous and sometimes there is a gap in the recorded time.

This would be my desired output:

Depth Time
0.2 2018-06-24 01:26:41
0.2 2018-06-24 01:26:56
0.1 2018-06-24 01:27:14
0.1 2018-06-24 01:27:30

I tried using the time difference between rows which works for the parts of the dataset that is consecutive:

    dt_filter <- d_cor %>%  
    mutate(diff = Time - lag(Time, default = first(Time)))  

     if ((dt_filter$diff < 2) ) {  
      dt_filter_1 <- dt_filter[seq(1, nrow(dt_filter), 15), ]  
     }

but that gives me a problem once I try to add the ones in that are not consecutive:

    dt_filter_15 <- dt_filter %>%    
      filter(diff >= 15 )

Since then I don't always have an interval of 15 secs and it obviously doesn't count them in.

So far I could not find a function that is able to filter my Time column. I am quite new to R so not too familiar with writing my own loops which I guess is neccessary...and time data does not make it easier.

Thanks for any help!

EDIT

@Ben Thanks for the quick replies!

That is some of the output that I've got:

        Depth Time                diff cumdiff x
        0.1   2018-06-23 23:59:44 1    1030    0
        0.0   2018-06-24 00:01:02 78   1035    5
        0.0   2018-06-24 00:01:03 1    1036    1

between the last two lines is only 1s of difference, but it is still added to cumdiff and therefore counted in the x column

Hobo
  • 7,536
  • 5
  • 40
  • 50

2 Answers2

1

Thanks to your links I found that function which seems to be working perfectly - just in case anyone has the same question:

issecsApart <- function(d_cor) {
  secs <- 0
  keeps <- c()
  for (d in d_cor) {
    if (d >= secs + 15) {
      secs <- d
      keeps <- c(keeps, TRUE)
    } else {
      keeps <- c(keeps, FALSE)
    }
  }
  keeps
}

d_cor2 <- d_cor %>% 
  arrange(Time) %>%
  filter(issecsApart(Time))

Appreciate all your help!

Ben
  • 28,684
  • 5
  • 23
  • 45
  • Looks good - I think you want to arrange by Time before filtering and calling the issecsApart function correct? (just edited) – Ben Sep 04 '19 at 03:19
0

One potential solution is with dplyr - though I am sure there may be better options available especially with data.table. As suggested by @42- and demonstrated by @Maurits Evers, you can do the following:

library(dplyr)

d_cor %>%
  arrange(Time) %>%
  mutate(
    diff = abs(lag(Time) - Time),
    diff = ifelse(is.na(diff), 0, diff),
    cumdiff = cumsum(diff) %/% 15,
    x = abs(lag(cumdiff) - cumdiff)) %>%
  filter(is.na(x) | x > 0) %>%
  select(Depth, Time)

  Depth                Time
1   0.1 2018-06-24 01:26:40
2   0.2 2018-06-24 01:26:56
3   0.1 2018-06-24 01:27:14
4   0.1 2018-06-24 01:27:30

diff will include the difference between times in seconds between consecutive rows. The first row would be NA (changed to 0).

cumdiff is the cumulative sum of diff but after modulo division by 15 (cumdiff increases by 1 after at least every 15 seconds).

The filter will include the first row (x = NA) and additional rows where cumdiff changes (at rows where at least 15 seconds lapsed).

Other examples that may be helpful and include data.table:

Filter rows by a time threshold

Subset observations that differ by at least 30 minutes time

Subset time series so that selected rows differs by a certain minimum time

Edit: This solution looks for times in fixed 15 second windows. There are problems related to diffs greater than 15. For those cases, it does not 'reset' and start a new 15 second window. Instead, it would include that time no matter what 15 second window it was in. Because of this we could potentially find times close to each other especially right afterwards.

Ben
  • 28,684
  • 5
  • 23
  • 45
  • Thank you guys so much! I was thinking about cumsum but couldn't work out how to incorporate the 15! That worked pretty well on most of the data but it seems that the higher diffs are troubling it. – Ninaninanska Sep 02 '19 at 19:01
  • @Ninaninanska can you share example data where this may not be working? What results are you getting instead? – Ben Sep 02 '19 at 19:02
  • I added some to the original question – Ninaninanska Sep 03 '19 at 07:24
  • @Ninaninanska Thanks for additional info - I will follow up shortly (still editing). Glad you found a solution! – Ben Sep 03 '19 at 20:08