I am using the following code to convert a data frame to a tidy data frame:
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% custom_stop_words2$word,
str_detect(word, "[a-zäöüß]"))
However, this produces a tidy data frame where German characters üäöß are removed from the newly-created word column, for example, "wählen" becomes two words: "w" and "hlen," and the special character is removed.
I am trying to get a tidy data frame of German words to do text analysis and term frequencies.
Could someone point me in the right direction for how to approach this problem?