R - remove rows from data frame that do not match (exactly) elements of list

Question

Imagine a data frame...

df <- rbind("A*YOU 1.000 0.780", "A*YOUR 1.000 0.780", "B*USE 0.800 0.678", "B*USER 0.700 1.000")
df <- as.data.frame(df)
df

... which prints...

> df
                  V1
1  A*YOU 1.000 0.780
2 A*YOUR 1.000 0.780
3  B*USE 0.800 0.678
4 B*USER 0.700 1.000

... and of which I would like to remove any row that does not contain exactly any element of a list (called tenables here) tenables <- c("A*YOU", "B*USE"), so that the outcome becomes:

> df
                  V1
1  A*YOU 1.000 0.780
2  B*USE 0.800 0.678

Any ideas on how to solve this? Many thanks in advance.

score 1 · Answer 1 · answered Dec 02 '22 at 15:17

1

> df[gsub("\\s*\\d+\\.*", "", df$V1) %in% tenables, ,drop=FALSE]
                 V1
1 A*YOU 1.000 0.780
3 B*USE 0.800 0.678

answered Dec 02 '22 at 15:17

Jilber Urbina

58,147
10
114
138

score 1 · Accepted Answer · answered Dec 02 '22 at 15:24

Since you have regex specials in tenables (* means "0 or more of the previous character/class/group"), we cannot use fixed=TRUE in the grep call. As such, we need to find those specials and backslash-escape them. From there, we'll add \\b (word-boundary) to differentiate between YOU and YOUR, where adding a space or any other character may be over-constraining.

## clean up tenables to be regex-friendly and precise
gsub("([].*+(){}[])", "\\\\\\1", tenables)
# [1] "A\\*YOU" "B\\*USE"

## combine into a single pattern for simple use in grep
paste0("\\b(", paste(gsub("([].*+(){}[])", "\\\\\\1", tenables), collapse = "|"), ")\\b")
# [1] "\\b(A\\*YOU|B\\*USE)\\b"

## subset your frame
subset(df, !grepl(paste0("\\b(", paste(gsub("([].*+(){}[])", "\\\\\\1", tenables), collapse = "|"), ")\\b"), V1))
#                   V1
# 2 A*YOUR 1.000 0.780
# 4 B*USER 0.700 1.000

Regex explanation:

\\b(A\\*YOU|B\\*USE)\\b
^^^                 ^^^  "word boundary", meaning the previous/next chars
                         are begin/end of string or from A-Z, a-z, 0-9, or _
   ^               ^     parens "group" the pattern so we can reference it
                         in the replacement string
    ^^^^^^^              literal "A", "*", "Y", "O", "U" (same with other string)
           ^             the "|" means "OR", so either the "A*" or the "B*" strings

With a more complex example, although I am incapable of explaining why, this is the only one of the three proposed solutions that worked. Since I wanted to *retain* all the lines that were matched by one of the elements of ```tenables``` (not the opposite as proposed here), I had to remove the ```!``` before ```grepl``` though. Like this, it works perfectly, many thanks! — CNiessen, Dec 05 '22 at 14:44
Oops, not sure why I reversed the logic, glad it worked for you. — r2evans, Dec 05 '22 at 14:58

score 0 · Answer 3 · answered Dec 02 '22 at 15:53

One approach using sapply on the strsplit column of df, only looking at the first entry of A*YOU 1.000 0.780, respectively.

df[sapply(strsplit(df$V1, " "), function(x) 
  any(grepl(x[1], tenables))), , drop=F]
                 V1
2 A*YOU 1.000 0.780
4 B*USE 0.800 0.678

R - remove rows from data frame that do not match (exactly) elements of list

3 Answers3