How to identify sequences in R

Question

This question is an extension of R - identify consecutive sequences

I have a data frame in which I need to keep only those trials where in the column ROI I have a consecutive sequence of _aCORRECT1 and _CORRECT1. It doesn't matter how many times _aCORRECT1 and _CORRECT1 occur, they can be repeated.

In the example below, I can keep ntrial 78 and 201, because _aCORRECT1 is followed by _CORRECT1. However, I need to remove the ntrial 10 and 400. In the trial 10 _aCORRECT1 is not followed by _CORRECT1. In the trial 400 _CORRECT1 is not preceded by _aCORRECT1.

Many thanks!

subject ROI                 ntrial 
sbj05   ff                  78     
sbj05   as                  78     
sbj05   fgfsd               78     
sbj05   sgf                 78     
sbj05   jh                  78     
sbj05   sgsgsfg             78     
sbj05   fgsfg               78     
sbj05   sgf_aCORRECT1       78     
sbj05   dfs_CORRECT1        78     
sbj05   ffg                 78     
sbj05   sdfdsf              78     
sbj05   sl                  78     
sbj05   wgrt                78     
sbj05   qswefrd             201    
sbj05   ssdg                201    
sbj05   sdgfdsg             201    
sbj05   sgsgd               201    
sbj05   sgsdg               201    
sbj05   dd_aCORRECT1        201    
sbj05   dd_aCORRECT1        201    
sbj05   ffds_CORRECT1       201    
sbj05   ffds_CORRECT1       201    
sbj05   ffds_CORRECT1       201    
sbj05   hy                  201    
sbj05   gfg                 201    
sbj05   nbc                 201    
sbj05   cvbvn               10     
sbj05   kpj                 10     
sbj05   nbvnb               10     
sbj05   mnm                 10     
sbj05   dghsfh_aCORRECT1    10     
sbj05   gdh                 10   
sbj05   fgjj                10     
sbj05   gnjdg               10     
sbj05   gf                  10     
sbj05   qw                  400    
sbj05   vfs                 400    
sbj05   zx                  400    
sbj05   zvzv                400    
sbj05   zvzv_CORRECT1       400    
sbj05   zvzd_CORRECT1       400    
sbj05   zvv                 400    
sbj05   cv                  400    
sbj05   v                   400    
sbj05   mngy                400

score 1 · Answer 1 · answered Apr 20 '17 at 17:20

Using dplyr, df1 is a dataframe telling you which values of ntrial should be kept. This is done by setting logical indicators for aCORRECT and _CORRECT and checking whether adjacent values exist for each grouped ntrial. df2 is then the version of df containing only the valid ntrials

df1 <- df %>% mutate(aCOR=grepl("aCORRECT",ROI),COR=grepl("_CORRECT",ROI)) %>%
              group_by(ntrial) %>% summarise(keep=any(aCOR & lead(COR)))

df2 <- df[df$ntrial %in% df1$ntrial[df1$keep],]


df1
# A tibble: 4 × 2
  ntrial  keep
   <int> <lgl>
1     10 FALSE
2     78  TRUE
3    201  TRUE
4    400 FALSE

df2
   subject           ROI ntrial
1    sbj05            ff     78
2    sbj05            as     78
3    sbj05         fgfsd     78
4    sbj05           sgf     78
5    sbj05            jh     78
6    sbj05       sgsgsfg     78
7    sbj05         fgsfg     78
8    sbj05 sgf_aCORRECT1     78
9    sbj05  dfs_CORRECT1     78
10   sbj05           ffg     78
11   sbj05        sdfdsf     78
12   sbj05            sl     78
13   sbj05          wgrt     78
14   sbj05       qswefrd    201
15   sbj05          ssdg    201
16   sbj05       sdgfdsg    201
17   sbj05         sgsgd    201
...

for some reason this code doesn't detect when for example *_CORRECT* is not preceded by *_aCORRECT*. Any idea why? — dede, Apr 21 '17 at 10:26
It works on the data you supplied. For `ntrial=400` it correctly detects that `_CORRECT` is not preceded by `_aCORRECT` and concludes that this value of `ntrial` should be excluded (i.e. `keep=FALSE` in `df1`). Is that not the behaviour you are looking for? Or do you have other data that it doesn't work on? — Andrew Gustar, Apr 21 '17 at 10:56
I have data with more subjects and it doesn't work - not sure why. I also tried to `group_by(subject,ntrial)` but it doesn't help. — dede, Apr 24 '17 at 15:30
Can you give an example of some data that it doesn't work on? — Andrew Gustar, Apr 24 '17 at 16:04

eipi10 · Answer 2 · 2017-04-20T18:35:41.067

We can extract the relevant portions of the two target strings in ROI and then filter to select only those values of ntrial where the two target strings occur consecutively.

library(dplyr)
library(stringr)

df %>% group_by(subject, ntrial) %>%
  filter(grepl("_aCORR_CORR", paste(str_extract(ROI, "_a?CORR"), collapse="")))

   subject           ROI ntrial
1    sbj05            ff     78
2    sbj05            as     78
3    sbj05         fgfsd     78
4    sbj05           sgf     78
5    sbj05            jh     78
6    sbj05       sgsgsfg     78
7    sbj05         fgsfg     78
8    sbj05 sgf_aCORRECT1     78
9    sbj05  dfs_CORRECT1     78
10   sbj05           ffg     78
11   sbj05        sdfdsf     78
12   sbj05            sl     78
13   sbj05          wgrt     78
14   sbj05       qswefrd    201
15   sbj05          ssdg    201
16   sbj05       sdgfdsg    201
17   sbj05         sgsgd    201
18   sbj05         sgsdg    201
19   sbj05  dd_aCORRECT1    201
20   sbj05  dd_aCORRECT1    201
21   sbj05 ffds_CORRECT1    201
22   sbj05 ffds_CORRECT1    201
23   sbj05 ffds_CORRECT1    201
24   sbj05            hy    201
25   sbj05           gfg    201
26   sbj05           nbc    201

Here's a data.table version that also uses base R gsub instead of str_extract:

library(data.table)

setDT(df)[, .SD[grepl("_aCORR_CORR", paste(gsub(".*(_a?CORR).*","\\1", ROI),collapse=""))], by=.(subject,ntrial)]

How to identify sequences in R

2 Answers2