I have a data frame in which the column foo
contains running sequences of NA values. For example:
> test
id foo time
1 1 <NA> 2018-11-19 00:00:48
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50
11 2 <NA> 2018-11-19 01:40:43
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46
25 3 <NA> 2018-11-19 04:00:46
My objective is to mark where each sequence begins by id
and time
for example - the above dataset would have an extra column called index
which marks where the starts and ends of these NA values are. However, the last NA in the id
series should be ignored, and a single NA value would be marked as "both". For example:
> test
id foo time index
1 1 <NA> 2018-11-19 00:00:48 na_starts
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02 na_ends
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07 na_starts
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50 na_starts
11 2 <NA> 2018-11-19 01:40:43 na_ends
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51 na_starts
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42 na_starts
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10 na_ends
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46 both
25 3 x 2018-11-19 04:00:46
How would one achieve this with rle
or a similar function in R?
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA,
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA,
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648,
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275,
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862,
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460,
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-25L), class = "data.frame")