Suppose table_a
consists of a column of approximately 1,500 unique Chinese characters and table_b
consists of a column of approximately 50,000 unique Chinese character combinations (multi-character phrases, sentences, etc. of differing lengths).
I would like to be able to filter through table_b
and return only the rows in which the character combinations only contain characters from the character column in table_a
. Ideally, this code should also ignore any alphanumeric characters and punctuation.
Is there a way to easily do this in R, preferably in base R or with functions within the tidyverse (dplyr, stringr, etc.)? I thought about using the stringr
package and regular expressions, but I'm not familiar with how that works with Chinese characters.
To slightly simplify the problem, consider the following example:
vec_a <- c("你","好","吗","不")
vec_b <- c("你好","你好吗?","我很好","我不好")
From these two lists, I'd like to return vec_c
, which is c("你好","你好吗?")
.
I'm thinking that whatever logic/function is used for this will be able to be used within dyplyr's filter function to achieve my goal.
Thanks for your help.