4

I have a big file with 107635 rows, and 3 columns: subject, regions of interest (ROIs), and number of the trial. The ROIs can be A, B, C, D, E, F. What I want to do is to keep only those trials where in the column ROI I have a consecutive sequence of B, C, D, the first time that B appears. It doesn't matter how many times B, C and D occur.

In the example below, I can keep ntrial 78 and 201, because the first time that B appeared was followed by C and D. However, I need to remove the ntrial 10 and 400. In the trial 10 B, C and D are not consecutive. In the trial 400 the first time that B appears, B is not followed by C and D.

For the output, I just need a column with a value of 1 for the trials to keep, in each row, and a value of 0 for the rows corresponding to the trials to remove.

Any suggestion on how to create a code that can automatise the procedure, without visually inspect each trial?

Many thanks!

subject ROI ntrial output
sbj05   A   78     1
sbj05   A   78     1
sbj05   A   78     1
sbj05   A   78     1
sbj05   A   78     1
sbj05   A   78     1
sbj05   B   78     1
sbj05   B   78     1
sbj05   C   78     1
sbj05   D   78     1
sbj05   E   78     1
sbj05   E   78     1
sbj05   E   78     1
sbj05   A   201    1
sbj05   A   201    1
sbj05   A   201    1
sbj05   A   201    1
sbj05   A   201    1
sbj05   B   201    1
sbj05   C   201    1
sbj05   D   201    1
sbj05   E   201    1
sbj05   E   201    1
sbj05   E   201    1
sbj05   F   201    1
sbj05   F   201    1
sbj05   A   10     0
sbj05   A   10     0
sbj05   A   10     0
sbj05   A   10     0
sbj05   B   10     0
sbj05   A   10     0
sbj05   C   10     0
sbj05   D   10     0
sbj05   E   10     0
sbj05   E   10     0
sbj05   A   400    0
sbj05   A   400    0
sbj05   A   400    0
sbj05   B   400    0
sbj05   A   400    0
sbj05   B   400    0
sbj05   C   400    0
sbj05   C   400    0
sbj05   C   400    0
sbj05   D   400    0
sbj05   E   400    0
sbj05   E   400    0
sbj05   D   400    0
dede
  • 1,129
  • 5
  • 15
  • 35
  • 2
    In the ntrial 78, it is `B B C D` so it is not the first B followed by C D – akrun Jul 15 '15 at 19:27
  • 1
    Please provide the desired output as your description is a bit confusing. – David Arenburg Jul 15 '15 at 19:51
  • @akrun what I mean by first B is the first time that B appears in the ROI column. Then, it doesn't matter to me if B is followed by many B before going to C and D. The important thing is that the only consecutive letters that can appear are "B", "C", and "D". Hope it clarifies.. – dede Jul 15 '15 at 19:52
  • 1
    @DavidArenburg, I included the column of the output I would like to obtain. – dede Jul 15 '15 at 19:55
  • Here is another one with base R: `df[as.logical(with(df, ave(ROI, list(ntrial), FUN = function(x) grepl("BCD", paste(x, collapse = ""))))),]` – Vlo Jul 15 '15 at 20:05
  • 1
    @Vlo That is not considering that `BCD` occurs after the first time `B` appears – akrun Jul 15 '15 at 20:08
  • @akrun Didn't notice that condition. `df[as.logical(with(df, ave(ROI, list(ntrial), FUN = function(x) grepl("BCD", paste(x[(which(x == "B")[1]):length(x)], collapse = ""))))),]` – Vlo Jul 15 '15 at 20:15
  • You should make an example when `B C D` happens when the `B` isn't appearing first. – David Arenburg Jul 15 '15 at 20:17
  • @Vlo But what if the `BCD` occurs somewhere in the middle For example. `df$ROI[45:47] <- c('B', 'C', 'D')` – akrun Jul 15 '15 at 20:17
  • @Vlo You have to check the results as I didn't get the expected result `unique(df[as.logical(with(df, ave(ROI, list(ntrial), FUN = function(x) grepl("BCD", paste(x[(which(x == "B")[1]):length(x)], collapse = ""))))),]$ntrial)#[1] 78 201 400` – akrun Jul 15 '15 at 21:14
  • If one of the answers worked for you, it would be appreciated if you accept the answer. This will give future readers a clue about the value of the solution. See also this help page: [What should I do when someone answers my question?](http://stackoverflow.com/help/someone-answers) – Jaap Oct 14 '15 at 11:14

3 Answers3

4

Here's an attempt using data.table and stringi

First, I'm defining some helper function that will help me detect first accurances of B per group and validate that they are followed by the correct sequence

Myfunc <- function(x) {
               which(x == "B")[1L] == 
               stri_locate_first_regex(paste(x, collapse = ""), 'B*CD')[, 1L]
              } 

Then, the implementation is straight forward

library(data.table)
library(stringi)
setDT(df)[, if(Myfunc(ROI)) .SD, by = .(subject, ntrial)]
#     subject ntrial ROI
#  1:   sbj05     78   A
#  2:   sbj05     78   A
#  3:   sbj05     78   A
#  4:   sbj05     78   A
#  5:   sbj05     78   A
#  6:   sbj05     78   A
#  7:   sbj05     78   B
#  8:   sbj05     78   B
#  9:   sbj05     78   C
# 10:   sbj05     78   D
# 11:   sbj05     78   E
# 12:   sbj05     78   E
# 13:   sbj05     78   E
# 14:   sbj05    201   A
# 15:   sbj05    201   A
# 16:   sbj05    201   A
# 17:   sbj05    201   A
# 18:   sbj05    201   A
# 19:   sbj05    201   B
# 20:   sbj05    201   C
# 21:   sbj05    201   D
# 22:   sbj05    201   E
# 23:   sbj05    201   E
# 24:   sbj05    201   E
# 25:   sbj05    201   F
# 26:   sbj05    201   F

Or, if you just want an additional column you could do

setDT(df)[, output := +Myfunc(ROI), by = .(subject, ntrial)]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • How would it be possible to use your solution when instead of having "B", "C" and "D" we have strings of letters like "_target1", "_target2", "_target3", in which the first part of the string differs for each observation but the second part of the string is the same (e.g., abcdsgf_target1, hjgjkfkgkr_target1, jjjjdsds_target2, ljkg_target3, hascj_target3)? – dede Apr 25 '17 at 13:19
  • I don't remember what was the original question already but in order to extract these strings you could simply use something like `sub(".*_", "", ROI)` (untested) and then adjust the code accordingly. – David Arenburg Apr 25 '17 at 13:23
1

Here's another one:

idx <- sapply(split(df, df$ntrial), function(x) { 
  B <- with(rle(x$ROI == "B"),  sum(lengths[seq(which.max(values))]))
  all(x$ROI[B:(B+2)] == c("B", "C", "D"))
})
subset(df, ntrial %in% names(which(idx)))
lukeA
  • 53,097
  • 5
  • 97
  • 100
0

A base R way with match and rle:

df$ output <- +as.logical(ave(as.character(df$ROI), df$ntrial, FUN=function(x) {rle(x[match("B",x):length(x)])$values[2] == "C"}))
#     subject ROI ntrial output
# 1    sbj05   A     78      1
# 2    sbj05   A     78      1
# 3    sbj05   A     78      1
# 4    sbj05   A     78      1
# 5    sbj05   A     78      1
# 6    sbj05   A     78      1
# 7    sbj05   B     78      1
# 8    sbj05   B     78      1
# 9    sbj05   C     78      1
# 10   sbj05   D     78      1
# 11   sbj05   E     78      1
# 12   sbj05   E     78      1
# 13   sbj05   E     78      1
# 14   sbj05   A    201      1
# 15   sbj05   A    201      1
# 16   sbj05   A    201      1
# 17   sbj05   A    201      1
# 18   sbj05   A    201      1
# 19   sbj05   B    201      1
# 20   sbj05   C    201      1
# 21   sbj05   D    201      1
# 22   sbj05   E    201      1
# 23   sbj05   E    201      1
# 24   sbj05   E    201      1
# 25   sbj05   F    201      1
# 26   sbj05   F    201      1
# 27   sbj05   A     10      0
# 28   sbj05   A     10      0
# 29   sbj05   A     10      0
# 30   sbj05   A     10      0
# 31   sbj05   B     10      0
# 32   sbj05   A     10      0
# 33   sbj05   C     10      0
# 34   sbj05   D     10      0
# 35   sbj05   E     10      0
# 36   sbj05   E     10      0
# 37   sbj05   A    400      0
# 38   sbj05   A    400      0
# 39   sbj05   A    400      0
# 40   sbj05   B    400      0
# 41   sbj05   A    400      0
# 42   sbj05   B    400      0
# 43   sbj05   C    400      0
# 44   sbj05   C    400      0
# 45   sbj05   C    400      0
# 46   sbj05   D    400      0
# 47   sbj05   E    400      0
# 48   sbj05   E    400      0
# 49   sbj05   D    400      0

the columns not being aligned were bugging me

Community
  • 1
  • 1
Pierre L
  • 28,203
  • 6
  • 47
  • 69