2

Suppose table_a consists of a column of approximately 1,500 unique Chinese characters and table_b consists of a column of approximately 50,000 unique Chinese character combinations (multi-character phrases, sentences, etc. of differing lengths).

I would like to be able to filter through table_b and return only the rows in which the character combinations only contain characters from the character column in table_a. Ideally, this code should also ignore any alphanumeric characters and punctuation.

Is there a way to easily do this in R, preferably in base R or with functions within the tidyverse (dplyr, stringr, etc.)? I thought about using the stringr package and regular expressions, but I'm not familiar with how that works with Chinese characters.

To slightly simplify the problem, consider the following example:

vec_a <- c("你","好","吗","不")
vec_b <- c("你好","你好吗?","我很好","我不好")

From these two lists, I'd like to return vec_c, which is c("你好","你好吗?").

I'm thinking that whatever logic/function is used for this will be able to be used within dyplyr's filter function to achieve my goal.

Thanks for your help.

Mark
  • 7,785
  • 2
  • 14
  • 34
clau
  • 93
  • 1
  • 5
  • Have a look [here](https://stackoverflow.com/questions/9576384/use-regular-expression-to-match-any-chinese-character-in-utf-8-encoding) – prosoitos Oct 21 '19 at 04:15
  • According to that post, you can use regexp with Chinese characters. So you should be able to do it with `stringr` – prosoitos Oct 21 '19 at 04:16
  • 1
    In base regex, `grep(paste0("^[", paste(vec_a, collapse = ""), "]+$"), vec_b, value = TRUE)`. With stringr, `str_subset(vec_b, paste0("^[", paste(vec_a, collapse = ""), "]+$"))` – alistaire Oct 21 '19 at 04:16
  • Both methods by @alistaire work fine. To get what you want, you may need to add "?" to `list_a`, or find an alternative way to deal with `?` – Zhiqiang Wang Oct 21 '19 at 05:18

1 Answers1

0

Expanding on alistaire's and Zhiqiang Wang, comments, what you want is the following:

pattern = paste0("^[", paste(vec_a, collapse = ""), "[:punct:]]+$")

Then grep(pattern, vec_b, value = TRUE) or str_subset(vec_b, pattern) will work.

[:punct:] adds a match for any punctuation character (see here for more information).

Mark
  • 7,785
  • 2
  • 14
  • 34