I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix. I create the Corpus out of spanish text (with special characters), with no issues.
However, when I create the Term Frequency Matrix (either with quanteda or tm libraries) the spanish characters do not display as expected (instead of seeing canción, I see canción).
Any suggestions on how I can get the Term Frequency Matrix to store the text with the correct characters?
Thank you for any help.
As a note: I prefer using the quanteda library, since ultimately I will be creating a wordcloud, and I think I better understand this library's approach. I am also using a Windows machine.
I have tried Encoding(tw2) <- "UTF-8" with no luck.
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT @None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\\b\\W*@\\w+)+)", "", clean_tw2)
clean_tw2 = gsub("@\\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms