You cannot use the %in%
operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in
in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl
for this (the l
is for logical, i.e. it returns TRUE
or FALSE
).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl
with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1
is in "genus1|species1|strain1"
.
Furthermore, you want to compare each element in file1$V1
to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any()
of those are a match. As you've tagged your question with tidyverse
, here's a dplyr
solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE