2

I have a data frame in which the column foo contains running sequences of NA values. For example:

> test
   id  foo                time
1   1 <NA> 2018-11-19 00:00:48
2   1 <NA> 2018-11-19 00:10:51
3   1 <NA> 2018-11-19 00:21:15
4   1 <NA> 2018-11-19 00:31:02
5   1    x 2018-11-19 00:40:59
6   1    x 2018-11-19 00:50:49
7   1    x 2018-11-19 01:01:15
8   1 <NA> 2018-11-19 01:11:07
9   1 <NA> 2018-11-19 01:20:49
10  2 <NA> 2018-11-19 01:30:50
11  2 <NA> 2018-11-19 01:40:43
12  2    x 2018-11-19 01:50:46
13  2    x 2018-11-19 02:01:02
14  2    x 2018-11-19 02:10:44
15  2 <NA> 2018-11-19 02:20:51
16  2 <NA> 2018-11-19 02:31:06
17  2 <NA> 2018-11-19 02:40:42
18  2 <NA> 2018-11-19 02:50:45
19  3 <NA> 2018-11-19 03:01:00
20  3 <NA> 2018-11-19 03:10:42
21  3 <NA> 2018-11-19 03:21:10
22  3 <NA> 2018-11-19 03:31:10
23  3    x 2018-11-19 03:40:44
24  3 <NA> 2018-11-19 03:50:46
25  3 <NA> 2018-11-19 04:00:46

My objective is to mark where each sequence begins by id and time for example - the above dataset would have an extra column called index which marks where the starts and ends of these NA values are. However, the last NA in the id series should be ignored, and a single NA value would be marked as "both". For example:

> test
   id  foo                time     index
1   1 <NA> 2018-11-19 00:00:48 na_starts
2   1 <NA> 2018-11-19 00:10:51          
3   1 <NA> 2018-11-19 00:21:15          
4   1 <NA> 2018-11-19 00:31:02   na_ends
5   1    x 2018-11-19 00:40:59          
6   1    x 2018-11-19 00:50:49          
7   1    x 2018-11-19 01:01:15          
8   1 <NA> 2018-11-19 01:11:07 na_starts
9   1 <NA> 2018-11-19 01:20:49          
10  2 <NA> 2018-11-19 01:30:50 na_starts
11  2 <NA> 2018-11-19 01:40:43   na_ends
12  2    x 2018-11-19 01:50:46          
13  2    x 2018-11-19 02:01:02          
14  2    x 2018-11-19 02:10:44          
15  2 <NA> 2018-11-19 02:20:51 na_starts
16  2 <NA> 2018-11-19 02:31:06          
17  2 <NA> 2018-11-19 02:40:42          
18  2 <NA> 2018-11-19 02:50:45          
19  3 <NA> 2018-11-19 03:01:00          
20  3 <NA> 2018-11-19 03:10:42 na_starts
21  3 <NA> 2018-11-19 03:21:10          
22  3 <NA> 2018-11-19 03:31:10   na_ends
23  3    x 2018-11-19 03:40:44          
24  3 <NA> 2018-11-19 03:50:46      both
25  3    x 2018-11-19 04:00:46   

How would one achieve this with rle or a similar function in R?

 dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA, 
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA, 
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648, 
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275, 
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862, 
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460, 
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-25L), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149
iskandarblue
  • 7,208
  • 15
  • 60
  • 130
  • This is interesting, what have you tried so far? – jay.sf Dec 21 '18 at 21:33
  • This problem is not specific to `data.table` or `dplyr`, so I removed the tags. If I'm wrong, perhaps (1) your data should be a `tbl_df` or `data.table`, and (2) you should show code you've tried so far relying on either of these packages. – r2evans Dec 21 '18 at 21:50
  • *"mark where each sequence begins by `id` and `time`"*: by time? Time is completely unique, by that then all `NA`s should be its own sequence. Perhaps you mean *"ordered by `time`, grouped by `id`"*? – r2evans Dec 21 '18 at 21:53
  • @r2evans that is correct – iskandarblue Dec 21 '18 at 22:03
  • Curious, why exactly do you need these arbitrary strings in your data? Functionally, I can't see any advantage to this alternate column. Is it purely for aesthetic reasons or something else? – r2evans Dec 21 '18 at 22:05

3 Answers3

2

Maybe this will work? I'm not entirely sure what relationship time has to the problem other than I think you wanted it sorted by id and time.

library("tidyverse")                                                                                                                                            -25L), class = "data.frame")
test = test %>% 
  arrange(id, time) %>% 
  mutate(miss = is.na(foo))

# This will make the index column for a single run
mark_ends = function(n, miss){
  if(!miss){
    rep("", times = n)
  }
  else{
    if(n == 1){"both"}
    else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
}

# This will use mark_ends across a single ID
mark_index = function(id){
   runs = test$miss[test$id == id] %>% 
     rle
  result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>% 
    reduce(.f = c)
  result[length(result)] = ""
  result
}

# use the function on each id, combine, and put it in test
test$index = unique(test$id) %>% 
  map(mark_index) %>% 
  reduce(.f = c)
svenhalvorson
  • 1,090
  • 8
  • 21
1

Using tidyverse and data.table you can do:

df %>%
 rowid_to_column() %>%
 group_by(id, temp = rleid(foo)) %>%
 mutate(temp2 = seq_along(temp),
        index = ifelse(is.na(foo) & temp2 == min(temp2) & temp2 == max(temp2), paste0("both"), 
                       ifelse(is.na(foo) & temp2 == min(temp2), paste0("na_starts"), 
                              ifelse(is.na(foo) & temp2 == max(temp2), paste0("na_ends"), NA)))) %>%
 group_by(id) %>%
 mutate(index = ifelse(rowid == max(rowid[is.na(foo) & max(temp) & max(temp2)]) & 
                         is.na(lag(foo)), NA, index)) %>%
 select(-temp, -temp2, -rowid)

      id foo   time                index    
   <dbl> <chr> <dttm>              <chr>    
 1    1. <NA>  2018-11-19 00:00:48 na_starts
 2    1. <NA>  2018-11-19 00:10:51 <NA>     
 3    1. <NA>  2018-11-19 00:21:15 <NA>     
 4    1. <NA>  2018-11-19 00:31:02 na_ends  
 5    1. x     2018-11-19 00:40:59 <NA>     
 6    1. x     2018-11-19 00:50:49 <NA>     
 7    1. x     2018-11-19 01:01:15 <NA>     
 8    1. <NA>  2018-11-19 01:11:07 na_starts
 9    1. <NA>  2018-11-19 01:20:49 <NA>     
10    2. <NA>  2018-11-19 01:30:50 na_starts
11    2. <NA>  2018-11-19 01:40:43 na_ends  
12    2. x     2018-11-19 01:50:46 <NA>     
13    2. x     2018-11-19 02:01:02 <NA>     
14    2. x     2018-11-19 02:10:44 <NA>     
15    2. <NA>  2018-11-19 02:20:51 na_starts
16    2. <NA>  2018-11-19 02:31:06 <NA>     
17    2. <NA>  2018-11-19 02:40:42 <NA>     
18    2. <NA>  2018-11-19 02:50:45 <NA>     
19    3. <NA>  2018-11-19 03:01:00 na_starts
20    3. <NA>  2018-11-19 03:10:42 <NA>     
21    3. <NA>  2018-11-19 03:21:10 <NA>     
22    3. <NA>  2018-11-19 03:31:10 na_ends  
23    3. x     2018-11-19 03:40:44 <NA>     
24    3. <NA>  2018-11-19 03:50:46 both     
25    3. x     2018-11-19 04:00:46 <NA> 

First, it is creating a unique row ID. Second, it is grouping by "id" and the run length of "foo". Third, it is sequencing around the run length of "foo". Forth, it is creating the "index" variable using the given conditions. Then, it is grouping by "id" and assigns NA to the last row of a missing "foo" sequence per id. Finally, it removes the redundant variables.

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
1

A possible solution using :

library(data.table)
setDT(test)

ind <- test[, .(ri = unique(.I[c(1,.N)][all(is.na(foo))]))
            , by = .(id, rl = rleid(is.na(foo)))
            ][, index := list("both",c("na_starts","na_ends"))[[1 + (.N > 1)]]
              , by = .(id, rl)][]

test[ind$ri, index := ind$index
     ][test[, .I[.N], by = id]$V1, index := NA][]

which gives:

> test
    id  foo                time     index
 1:  1 <NA> 2018-11-19 00:00:48 na_starts
 2:  1 <NA> 2018-11-19 00:10:51      <NA>
 3:  1 <NA> 2018-11-19 00:21:15      <NA>
 4:  1 <NA> 2018-11-19 00:31:02   na_ends
 5:  1    x 2018-11-19 00:40:59      <NA>
 6:  1    x 2018-11-19 00:50:49      <NA>
 7:  1    x 2018-11-19 01:01:15      <NA>
 8:  1 <NA> 2018-11-19 01:11:07 na_starts
 9:  1 <NA> 2018-11-19 01:20:49      <NA>
10:  2 <NA> 2018-11-19 01:30:50 na_starts
11:  2 <NA> 2018-11-19 01:40:43   na_ends
12:  2    x 2018-11-19 01:50:46      <NA>
13:  2    x 2018-11-19 02:01:02      <NA>
14:  2    x 2018-11-19 02:10:44      <NA>
15:  2 <NA> 2018-11-19 02:20:51 na_starts
16:  2 <NA> 2018-11-19 02:31:06      <NA>
17:  2 <NA> 2018-11-19 02:40:42      <NA>
18:  2 <NA> 2018-11-19 02:50:45      <NA>
19:  3 <NA> 2018-11-19 03:01:00 na_starts
20:  3 <NA> 2018-11-19 03:10:42      <NA>
21:  3 <NA> 2018-11-19 03:21:10      <NA>
22:  3 <NA> 2018-11-19 03:31:10   na_ends
23:  3    x 2018-11-19 03:40:44      <NA>
24:  3 <NA> 2018-11-19 03:50:46      both
25:  3    x 2018-11-19 04:00:46      <NA>
Jaap
  • 81,064
  • 34
  • 182
  • 193