Tidy data frame: German characters being removed

Question

I am using the following code to convert a data frame to a tidy data frame:

replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>% 
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% custom_stop_words2$word,
     str_detect(word, "[a-zäöüß]"))

However, this produces a tidy data frame where German characters üäöß are removed from the newly-created word column, for example, "wählen" becomes two words: "w" and "hlen," and the special character is removed.

I am trying to get a tidy data frame of German words to do text analysis and term frequencies.

Could someone point me in the right direction for how to approach this problem?

Replace all `A-Za-z` with `[:alpha:]`. Well, `A-Za-z\\d` can be replaced with `[:alnum:]`. Note: this does not always work, so please check on your end. — Wiktor Stribiżew, Jul 25 '17 at 13:25
if you tokenize with `cleanNLP` package you can use `init_tokenizers(locale = "German)` — s.brunel, Jul 25 '17 at 13:52
you could also change your [A-Za-z] to a hex range: http://www.asciitable.com/ and specify all the characters you want to keep. Along the lines of [^\x20-\x7E] and just add \x{germanstuff} for the other ones — sniperd, Jul 25 '17 at 14:41
Thanks @WiktorStribiżew, [:alum:] has nearly done the trick! I still have some stray letters that appear as words, but it does not seem related to this problem. — mundos, Jul 26 '17 at 07:42

score 2 · Accepted Answer · answered Jul 26 '17 at 07:49

You need to replace all A-Za-z\\d in your bracket expressions with [:alnum:].

The POSIX character class [:alnum:] matches Unicode letters and digits.

replace_reg <- "https://t.co/[[:alnum:]]+|http://[[:alnum:]]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^[:alnum:]_#@']|'(?![[:alnum:]_#@]))"

If you are using these pattern with stringr functions, you may also consider using [\\p{L}\\p{N}] instead, like in

unnest_reg <- "([^\\p{L}\\p{N}_#@']|'(?![\\p{L}\\p{N}_#@]))"

where \p{L} matches any Unicode letter and \p{N} matches any Unicode digit.

Tidy data frame: German characters being removed

1 Answers1