While playing around with lemmatizing, stopwords removal, stemming etc. for German text, I had problems using the tokens_replace()
function in the quanteda
package. I found a solution (see code) which seems to work although I do not understand why. Thus, could someone explain:
Why does tokens_replace()
from the quanteda
package only work correctly when using stri_trans_general()
and not when using Encoding()
for German special characters although both seem to change encoding in the same way?
Here is a piece of reproducible code:
First, I created a dictionary with words I would like to replace (V2
) and their replacement (V1
). Using tokens_replace()
with this data works but replaces the special characters for the transformed words which I do not want:
dict <- as.data.frame(rbind(c("Bürger", "Bürgerin"), c("Bürger", "Bürgerinnen"),
c("Menschenrecht", "Menschenrechte"), c("Menschenrecht", "Menschenrechts"),
c("Straße", "Straßen")))
dict$V1 <- as.character(dict$V1)
dict$V2 <- as.character(dict$V2)
library(quanteda)
tokens <- tokens(c("Bürger", "Bürgerinnen", "Menschenrechte", "Menschenrechts", "Straße", "Straßen"))
tokens <- tokens_replace(x = tokens, pattern = dict$V2, replacement = dict$V1, valuetype = "fixed")
tokens
tokens from 6 documents:
text1 : [1] "Bürger"; text2 : [1] "B\xfcrger"; text3 : [1] "Menschenrecht"; text4 : [1] "Menschenrecht"; text5 : [1] "Straße"; text6 : [1] "Stra\xdfe""
Using the same dictionary and the same tokens object and Encoding() <- "UTF-8"
, I get the following results. Note that I use the stringi
package only to show that the encoding of the dictionary entries was changed from latin1/ASCII
to utf8/ASCII
.
library(stringi)
stri_enc_mark(dict$V1)
#[1] "latin1" "latin1" "ASCII" "ASCII" "latin1"
Encoding(dict$V1) <- "UTF-8"
Encoding(dict$V1)
#[1] "UTF-8" "UTF-8" "unknown" "unknown" "UTF-8"
stri_enc_mark(dict$V1)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"
stri_enc_mark(dict$V2)
#[1] "latin1" "latin1" "ASCII" "ASCII" "latin1"
Encoding(dict$V2) <- "UTF-8"
Encoding(dict$V2)
#[1] "UTF-8" "UTF-8" "unknown" "unknown" "UTF-8"
stri_enc_mark(dict$V2)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"
tokens <- tokens(c("Bürger", "Bürgerinnen", "Menschenrechte", "Menschenrechts", "Straße", "Straßen"))
tokens <- tokens_replace(x = tokens, pattern = dict$V2, replacement = dict$V1, valuetype = "fixed")
tokens
"tokens from 6 documents: text1 : [1] "Bürger"; text2 : [1] "Bürgerinnen"; text3 : [1] "Menschenrecht"; text4 : [1] "Menschenrecht"; text5 : [1] "Straße"; text6 : [1] "Straßen"" - So basically tokens_replace() did not replace anything.
Again, I use the original dataset as created above and do the same transformation but with stri_trans_general()
from the stringi
package. Now it does exactly what I want - which I do not understand because the encoding has been changed in the exact same way (from latin1/ASCII
to utf8/ASCII
).
dict <- as.data.frame(rbind(c("Bürger", "Bürgerin"), c("Bürger", "Bürgerinnen"),
c("Menschenrecht", "Menschenrechte"), c("Menschenrecht", "Menschenrechts"),
c("Straße", "Straßen")))
dict$V1 <- as.character(dict$V1)
dict$V2 <- as.character(dict$V2)
tokens <- tokens(c("Bürger", "Bürgerinnen", "Menschenrechte", "Menschenrechts", "Straße", "Straßen"))
stri_enc_mark(dict$V1)
dict$V1 <- stri_trans_general(dict$V1, "ASCII-Latin")
Encoding(dict$V1)
#[1] "UTF-8" "UTF-8" "unknown" "unknown" "UTF-8"
stri_enc_mark(dict$V1)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"
stri_enc_mark(dict$V2)
dict$V2 <- stri_trans_general(dict$V2, "ASCII-Latin")
Encoding(dict$V2)
#[1] "UTF-8" "UTF-8" "unknown" "unknown" "UTF-8"
stri_enc_mark(dict$V2)
#[1] "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8"
tokens <- tokens_replace(x = tokens, pattern = dict$V2, replacement = dict$V1, valuetype = "fixed")
tokens
I would appreciate any comments on this. My guess is that it is about how Encoding()
handles UTF-8
vs. how stringi
handles UTF-8
but I would love to get more details. Am I missing a crucial point here?