R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

Question

I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix. I create the Corpus out of spanish text (with special characters), with no issues.

However, when I create the Term Frequency Matrix (either with quanteda or tm libraries) the spanish characters do not display as expected (instead of seeing canción, I see canciÃ³n).

Any suggestions on how I can get the Term Frequency Matrix to store the text with the correct characters?

Thank you for any help.

As a note: I prefer using the quanteda library, since ultimately I will be creating a wordcloud, and I think I better understand this library's approach. I am also using a Windows machine.

I have tried Encoding(tw2) <- "UTF-8" with no luck.

library(dplyr)
library(tm)
library(quanteda)

#' Creating a character with special Spanish characters:
tw2 <- "RT @None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción  . https://t."


#Cleaning the tweet, removing special punctuation, numbers http links, 
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&amp", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\\b\\W*@\\w+)+)", "", clean_tw2)
clean_tw2 = gsub("@\\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2) 

# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))

# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2

#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts

#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)

# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canciÃ³n
tdm_quan

# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)

# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms

score 1 · Answer 1 · answered Apr 26 '18 at 12:41

Let me guess...are you using Windows? On macOS it works fine:

clean_tw2
## [1] "enmascarados si masduro chingán   si quieres   aguantas canción"
Encoding(clean_tw2)
## [1] "UTF-8"
dfm(clean_tw2)
## Document-feature matrix of: 1 document, 7 features (0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
##        features
## docs    enmascarados si masduro chingán quieres aguantas canción
##   text1            1  2       1       1       1        1       1

My system information:

sessionInfo()
# R version 3.4.4 (2018-03-15)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.4
# 
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] tm_0.7-3       NLP_0.1-11     dplyr_0.7.4    quanteda_1.1.6

@Ken Benoit, might this be an issue with the to / from C++ travel? — phiver, Apr 26 '18 at 13:49
Check `Encoding(clean_tw)` before you send that to `corpus()`. — Ken Benoit, Apr 26 '18 at 13:50
On my machine that is `UTF-8`. Also `Encoding(corp_quan$documents$texts)` is `UTF-8`. But if I look at `Encoding(tdm_quan@Dimnames$features)` the result is unknown. — phiver, Apr 26 '18 at 13:54

score 1 · Accepted Answer · answered Apr 26 '18 at 13:47

It looks like quanteda (and tm) is losing the encoding when creating the DFM on the windows platform. In this tidytext question the same problem happens with unnesting tokens. Which works fine now and also quanteda's tokens works fine. If I enforce UTF-8 or latin1 encoding on the @Dimnames$features of the dfm you get the correct results.

....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canciÃ³n
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
       features
docs    enmascarados si masduro chingÃ¡n quieres aguantas canciÃ³n t
  text1            1  2       1        1       1        1        1 1

If you do the following:

Encoding(tdm_quan@Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
       features
docs    enmascarados si masduro chingán quieres aguantas canción t
  text1            1  2       1       1       1        1       1 1

Thank you phiver. Forcing the encoding worked like a charm. Once I get my hands on an OSx I will try the original code and see if the original code works. thanks again! — Beep, Apr 26 '18 at 17:38

R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

2 Answers2