0

I have a data frame df. It has several columns, two of them are dates and serial_day, corresponding to the date an observation was taken and MATLAB's serial day. I would like to restrict my time series such that the increment (in days) between two consecutive observations is 3 or 4 and separate such blocks by a NA row.

It is known that consecutive daily observations never occur and the case of 2 day separation followed by 2 day separation is rare, so it can be ignored.

In the example, increment is shown for convenience, but it is easily generated using the diff function. So, if the data frame is

   serial_day   increment 
1      4           NA
2      7           3 
3      10          3
4      12          2
5      17          5
6      19          2
7      22          3
8      25          3
9      29          4
10     34          5 

I would hope to get a new data frame as:

    serial_day                      increment
1      4                                NA
2      7                                3
3      10                               3
4      NA ## Entire row of NAs          NA
5      19                               NA
6      22                               3
7      25                               3
8      29                               4 
9      NA ## Entire row of NAs          NA

I can't figure out a way to do this without looping, which is bad idea in R.

Yair Daon
  • 1,043
  • 2
  • 15
  • 27
  • Shouldn't row 6 in the first `df` be deleted too? The increment in that row is smaller than 3. If I understand your description correctly, you wish to keep only rows where the increment was either 3 or 4, right? – KenHBS Aug 24 '17 at 04:25
  • @KenS. I'd like to keep row 6 in the original data frame, since the time increment from 6 to 7 is 3. – Yair Daon Aug 24 '17 at 17:52

1 Answers1

1

First you check in which rows the increment is not equal to 3 or 4. Then you'd replace these rows with a row of NAs:

inds <- which( df$increment > 4 | df$increment < 3 )
df[inds, ] <- rep(NA, ncol(df))

#    serial_day increment
# 1           4        NA
# 2           7         3
# 3          10         3
# 4          NA        NA
# 5          NA        NA
# 6          NA        NA
# 7          22         3
# 8          25         3
# 9          29         4
# 10         NA        NA

This may result in multiple consecutive rows of NAs. In order to reduce these consecutive NA-rows to a single NA-row, you'd check where the NA-rows are located with which() and then see whether these locations are consecutive with diff() and remove these rows from df:

NArows <- which(rowSums(is.na(df)) == ncol(df))       # c(4, 5, 6, 10)
inds2  <- NArows[c(FALSE, diff(NArows) == 1)]         # c(5, 6)
df     <- df[-inds2, ]

#    serial_day increment
# 1           4        NA
# 2           7         3
# 3          10         3
# 4          NA        NA
# 7          22         3
# 8          25         3
# 9          29         4
# 10         NA        NA
KenHBS
  • 6,756
  • 6
  • 37
  • 52