0

While playing around with lemmatizing, stopwords removal, stemming etc. for German text, I had problems using the tokens_replace() function in the quanteda package. I found a solution (see code) which seems to work although I do not understand why. Thus, could someone explain:

Why does tokens_replace() from the quanteda package only work correctly when using stri_trans_general() and not when using Encoding() for German special characters although both seem to change encoding in the same way?

Here is a piece of reproducible code:

First, I created a dictionary with words I would like to replace (V2) and their replacement (V1). Using tokens_replace() with this data works but replaces the special characters for the transformed words which I do not want:

dict <- as.data.frame(rbind(c("Bürger", "Bürgerin"), c("Bürger", "Bürgerinnen"), 
                            c("Menschenrecht", "Menschenrechte"), c("Menschenrecht", "Menschenrechts"),
                            c("Straße", "Straßen")))
dict$V1 <- as.character(dict$V1)
dict$V2 <- as.character(dict$V2)

library(quanteda)
tokens <- tokens(c("Bürger", "Bürgerinnen", "Menschenrechte", "Menschenrechts", "Straße", "Straßen"))
tokens <- tokens_replace(x = tokens, pattern = dict$V2, replacement = dict$V1, valuetype = "fixed")
tokens

tokens from 6 documents: 
text1 : [1] "Bürger"; text2 : [1] "B\xfcrger"; text3 : [1] "Menschenrecht"; text4 : [1] "Menschenrecht"; text5 : [1] "Straße"; text6 : [1] "Stra\xdfe""

Using the same dictionary and the same tokens object and Encoding() <- "UTF-8", I get the following results. Note that I use the stringi package only to show that the encoding of the dictionary entries was changed from latin1/ASCII to utf8/ASCII.

library(stringi)
stri_enc_mark(dict$V1)
#[1] "latin1" "latin1" "ASCII"  "ASCII"  "latin1"
Encoding(dict$V1) <- "UTF-8"
Encoding(dict$V1)
#[1] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"  
stri_enc_mark(dict$V1)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"

stri_enc_mark(dict$V2)
#[1] "latin1" "latin1" "ASCII"  "ASCII"  "latin1"
Encoding(dict$V2) <- "UTF-8"
Encoding(dict$V2)
#[1] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"  
stri_enc_mark(dict$V2)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"

tokens <- tokens(c("Bürger", "Bürgerinnen", "Menschenrechte", "Menschenrechts", "Straße", "Straßen"))
tokens <- tokens_replace(x = tokens, pattern = dict$V2, replacement = dict$V1, valuetype = "fixed")
tokens

"tokens from 6 documents: text1 : [1] "Bürger"; text2 : [1] "Bürgerinnen"; text3 : [1] "Menschenrecht"; text4 : [1] "Menschenrecht"; text5 : [1] "Straße"; text6 : [1] "Straßen"" - So basically tokens_replace() did not replace anything.

Again, I use the original dataset as created above and do the same transformation but with stri_trans_general() from the stringi package. Now it does exactly what I want - which I do not understand because the encoding has been changed in the exact same way (from latin1/ASCII to utf8/ASCII).

dict <- as.data.frame(rbind(c("Bürger", "Bürgerin"), c("Bürger", "Bürgerinnen"), 
                            c("Menschenrecht", "Menschenrechte"), c("Menschenrecht", "Menschenrechts"),
                            c("Straße", "Straßen")))
dict$V1 <- as.character(dict$V1)
dict$V2 <- as.character(dict$V2)
tokens <- tokens(c("Bürger", "Bürgerinnen", "Menschenrechte", "Menschenrechts", "Straße", "Straßen"))

stri_enc_mark(dict$V1)
dict$V1 <- stri_trans_general(dict$V1, "ASCII-Latin")
Encoding(dict$V1)
#[1] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"  
stri_enc_mark(dict$V1)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"

stri_enc_mark(dict$V2)
dict$V2 <- stri_trans_general(dict$V2, "ASCII-Latin")
Encoding(dict$V2)
#[1] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"  
stri_enc_mark(dict$V2)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"

tokens <- tokens_replace(x = tokens, pattern = dict$V2, replacement = dict$V1, valuetype = "fixed")
tokens

I would appreciate any comments on this. My guess is that it is about how Encoding() handles UTF-8 vs. how stringi handles UTF-8 but I would love to get more details. Am I missing a crucial point here?

Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38
LeaK
  • 31
  • 7
  • Thank you for your efforts to make my question more accessible! It looks great. – LeaK Feb 20 '19 at 13:32
  • 2
    Hi I am a developer of the package. You are right that in some cases Unicode normalization `stri_trans_nfc()` is needed. You faced the problem because `tokens_replace()` does not do it internally in some cases. I will change that in the Github version. By the way, the difference between `Encoding()` and `stri_trans_*()` is that the former only changes encoding flags while the latter changes Unicode itself. – Kohei Watanabe Feb 20 '19 at 20:28
  • Thanks. That explains it. Very helpful to know what exactly I am doing. – LeaK Feb 22 '19 at 06:44

0 Answers0