R - Subset dataframe based on a repeated sequence

Question

I am trying to subset a data frame based on specific sequence occurring in column v3. A sample of a dataframe:

v1 <- c(1:20)
v2 <- c(1,1,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0)
v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3)
my_df <- data.frame(v1,v2,v3)         # creating a dataframe

sample output for my_df

   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
4   4  0  3
5   5  1  2
6   6  0  3
7   7  1  2
8   8  1  4
9   9  1  4
10 10  0  2
11 11  1  3
12 12  1  2
13 13  0  3
14 14  0  3
15 15  0  3
16 16  1  4
17 17  1  4
18 18  0  2
19 19  0  3
20 20  0  3

The output I am trying to achieve should look like this

1   1  1  4
2   2  1  4
3   3  0  2
8   8  1  4
9   9  1  4
10 10  0  2
16 16  1  4
17 17  1  4
18 18  0  2

So I want to subset my df according to sequence of 4 4 2 in column v3. What I tried so far is:

my_df[which(c(diff(v3))==-2),]

but this only extracts the middle four of the sequence 4 4 2 like

v1 v2 v3
 2  2  1  4
 9  9  1  4
17 17  1  4

Another option I tried:

m = match(v3, c(4,4,2))
> m
 [1]  1  1  3 NA  3 NA  3  1  1  3 NA  3 NA NA NA  1  1  3 NA NA  
> my_df[!is.na(m),]
   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
5   5  1  2
7   7  1  2
8   8  1  4
9   9  1  4
10 10  0  2
12 12  1  2
16 16  1  4
17 17  1  4
18 18  0  2

This output gives me all 4 and 2 but not the sequence 4 4 2 that I want. Any help would be appreciated. I already achieved this in matlab with for and if loop but I am just wondering how I can solve this in R in a loopless way.

See [here](http://stackoverflow.com/questions/33027611/how-to-index-a-vector-sequence-within-a-vector-sequence) a similar question on finding a sequence in a numeric vector. All answers there output a vector of indices of where a match occurs (here, `c(1, 8, 16)`) which you can use to subset your `my_df`. I.e. assuming `i = c(1, 8, 16)`, then `c(outer(0:2, i, "+"))` gives all indices — alexis_laz, Aug 29 '16 at 17:19

akrun · Accepted Answer · 2016-08-29T17:16:13.080

2

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(my_df)). Using shift from data.table, we get the next element with type = "lead". As shift takes a vector of n, we specify n = 0:2, so that we get three columns with the n = 0 corresponds to the original 'v3' column and others 1st and 2nd next values. Then, paste the elements rowwise (do.call(paste0, ...), check whether it is equal to 442, get the index of TRUE values (which), use rep to replicate the index and add with 0:2 so that we get the index of the three rows for each index. This can be used to subset the original dataset rows.

library(data.table)
setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2,
                 type= "lead")) == 442), each = 3) + 0:2]]
#   v1 v2 v3
#1:  1  1  4
#2:  2  1  4
#3:  3  0  2
#4:  8  1  4
#5:  9  1  4
#6: 10  0  2
#7: 16  1  4
#8: 17  1  4
#9: 18  0  2

data

my_df <- structure(list(v1 = 1:20, v2 = c(1L, 1L, 0L, 0L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), v3 = c(4L, 
4L, 2L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 2L, 3L, 3L, 3L, 4L, 4L, 
2L, 3L, 3L)), .Names = c("v1", "v2", "v3"), class = "data.frame", 
row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20"))

edited Aug 29 '16 at 17:16

answered Aug 29 '16 at 17:10

akrun

874,273
37
540
662

One question, applying your technique on the output sequence now I am trying to count the 110 and 111 from v2. `which(do.call(paste0, shift(seq$v2, 0:2, type= "lead")) == 110) [1] 7` so I know that there is only one 110 sequence, but it doesn't work same for the 111 `which(do.call(paste0, shift(seq$v2, 0:2, type= "lead")) == 111) [1] 1 2 3 4 5 6` it gives all the indices. Updated v2 `v2 <- c(1,1,1,0,1,0,1,1,1,1,1,1,0,0,0,1,1,0,0,0)` – Amgh Sep 02 '16 at 16:06
@Amgh Your example doesn't have 111. – akrun Sep 02 '16 at 16:10
Yes, I updated v2 for further analysis `v1 <- c(1:20) > v2 <- c(1,1,1,0,1,0,1,1,1,1,1,1,0,0,0,1,1,0,0,0) > v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3) > my_df <- data.frame(v1,v2,v3) # creating a datafram` – Amgh Sep 02 '16 at 16:54
@Amgh I get `setDT(my_df)[, which(do.call(paste0, shift(v3, 0:2, type= "lead")) == 111)] #integer(0)` – akrun Sep 02 '16 at 17:02
we get the output of this updated dataframe like `seq <- setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2, type= "lead")) == 442), each = 3) + 0:2]]` and now I did apply `which(do.call(paste0, shift(seq$v2, 0:2, type= "lead")) == 111)` output is ´[1] 1 2 3 4 5 6´ but if I do `which(do.call(paste0, shift(seq$v2, 0:2, type= "lead")) == 110)` then output is `[1] 7` – Amgh Sep 03 '16 at 10:46
@Amgh Can you update your post with a new example as the example and the output is not matching. – akrun Sep 03 '16 at 12:34

lmo · Answer 2 · 2016-08-29T17:59:46.750

As long as v3 does not have any missing values and the values of v3 are single characters, you can also use gregexpr to accomplish this as follows

# get the row indices where the pattern 442 starts c(1 , 8, 16)
rowstarts <- unlist(gregexpr("442", paste(my_df$v3, collapse="")))

# extract rows from the data fram
dfNew <- my_df[sort(c(outer(rowstarts, (0:2), "+"))), ]

which returns

dfNew
   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
8   8  1  4
9   9  1  4
10 10  0  2
16 16  1  4
17 17  1  4
18 18  0  2

paste with the collapse argument turns the vector v3 into a single character string. grexpr then finds the starting position in this string for any "442" subexpression.

The final step subsets the data.frame using the outer function suggested by @alexis-laz's in the comments above.

R - Subset dataframe based on a repeated sequence

2 Answers2

data