Filtering criteria for rows that only contain certain Chinese characters

Question

Suppose table_a consists of a column of approximately 1,500 unique Chinese characters and table_b consists of a column of approximately 50,000 unique Chinese character combinations (multi-character phrases, sentences, etc. of differing lengths).

I would like to be able to filter through table_b and return only the rows in which the character combinations only contain characters from the character column in table_a. Ideally, this code should also ignore any alphanumeric characters and punctuation.

Is there a way to easily do this in R, preferably in base R or with functions within the tidyverse (dplyr, stringr, etc.)? I thought about using the stringr package and regular expressions, but I'm not familiar with how that works with Chinese characters.

To slightly simplify the problem, consider the following example:

vec_a <- c("你","好","吗","不")
vec_b <- c("你好","你好吗？","我很好","我不好")

From these two lists, I'd like to return vec_c, which is c("你好","你好吗？").

I'm thinking that whatever logic/function is used for this will be able to be used within dyplyr's filter function to achieve my goal.

Thanks for your help.

Have a look [here](https://stackoverflow.com/questions/9576384/use-regular-expression-to-match-any-chinese-character-in-utf-8-encoding) — prosoitos, Oct 21 '19 at 04:15
According to that post, you can use regexp with Chinese characters. So you should be able to do it with `stringr` — prosoitos, Oct 21 '19 at 04:16
In base regex, `grep(paste0("^[", paste(vec_a, collapse = ""), "]+$"), vec_b, value = TRUE)`. With stringr, `str_subset(vec_b, paste0("^[", paste(vec_a, collapse = ""), "]+$"))` — alistaire, Oct 21 '19 at 04:16
Both methods by @alistaire work fine. To get what you want, you may need to add "?" to `list_a`, or find an alternative way to deal with `?` — Zhiqiang Wang, Oct 21 '19 at 05:18

score 0 · Answer 1 · answered Jul 28 '23 at 05:38

Expanding on alistaire's and Zhiqiang Wang, comments, what you want is the following:

pattern = paste0("^[", paste(vec_a, collapse = ""), "[:punct:]]+$")

Then grep(pattern, vec_b, value = TRUE) or str_subset(vec_b, pattern) will work.

[:punct:] adds a match for any punctuation character (see here for more information).

Filtering criteria for rows that only contain certain Chinese characters

1 Answers1