0

I performed LDA in Linux and didn't get characters like "ø" in topic 2. However, when run in Windows, they show. Does anyone know how to deal with this? I used packages quanteda and topicmodels.

> terms(LDAModel1,5)
Topic 1  Topic 2
[1,] "car"    "ø"
[2,] "build"  "ù"
[3,] "work"   "network"
[4,] "drive"  "ces"
[5,] "musk"   "new"

Edit:

Data: https://www.dropbox.com/s/tdr9yok7tp0pylz/technology201501.csv

The code is something like this:

library(quanteda)
library(topicmodels)

myCorpus <- corpus(textfile("technology201501.csv", textField = "title"))
myDfm <- dfm(myCorpus,ignoredFeatures=stopwords("english"), stem = TRUE,   removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
myDfm <-removeFeatures(myDfm, c("reddit", "redditors","redditor","nsfw", "hey", "vs", "versus", "ur", "they'r", "u'll", "u.","u","r","can","anyone","will","amp","http","just"))
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.9999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
LDAModel1 <- LDA(quantedaformat2dtm(myDfm2), 25, 'Gibbs', list(iter=4000,seed = 123))
user1569341
  • 333
  • 1
  • 6
  • 17
  • 2
    I guess different locales. –  Jan 13 '16 at 03:19
  • 1
    You didn't really provide enough data to make the problem reproducible. I would guess the problem is with file encoding. Windows assumes files are in a "latin-1" encoding. Your linux OS may assume UTF-8 encoding. It is important that you know what encoding was used in your data files and to properly read the data in with the correct encoding. You don't show any of your import steps so it's hard to know what you may have done. – MrFlick Jan 13 '16 at 03:26
  • I tried different encodings like this https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding, but it did not work. – user1569341 Jan 13 '16 at 04:50
  • Note there is a recently added `sparsity =` argument to `trim()` that mimics the tm usage, if that is how you want to think of sparsity. – Ken Benoit Jan 13 '16 at 12:22

1 Answers1

0

It's an encoding issue, coupled with the different locales available in R using Windows and Linux. (Try: Sys.getlocale()) Windows uses .1252 by default (aka "cp1252", "WINDOWS-1252") while Linux and OS X use UTF-8. My guess is that technology201501.csv is encoded as UTF-8 and is getting converted to 1252 when you read it into R Windows, these characters are doing something odd to the words, and creating apparent tokens as the character (but without a reproducible example, it's impossible for me to tell). By contrast, in Linux the words containing "ø" etc se are preserved because there is no conversion. Conversion might be mangling the words with extended (outside of the 7-bit "ASCII" range) characters, since there is no mapping of these UTF-encoded Unicode code points to a place in the 8-bit WINDOWS-1252 encoding, even though such points exist in that encoding.

To convert, it should work if you alter your call to:

myCorpus <- corpus(textfile("technology201501.csv", textField = "title", fileEncoding = "UTF-8"))

as the last argument is passed straight to read.csv() by textfile(). (This is only true in the newest version however, 0.9.2.)

You can verify the encoding of your .csv file using file technology201501.csv at the command line. This is included with nearly every Linux distro and OS X, but also is installed with RTools on Windows.

Ken Benoit
  • 14,454
  • 27
  • 50
  • I got this warning message. The result is still the same. Warning message: In textfile("technology201501.csv", textField = "title", fileEncoding = "UTF-8") : Argument fileEncoding not used – user1569341 Jan 13 '16 at 17:24
  • What does `packageVersion("quanteda")` return? – Ken Benoit Jan 13 '16 at 17:30
  • it returns ‘0.9.2.0’ – user1569341 Jan 13 '16 at 17:50
  • I suggest you `devtools::install_github("kbenoit/quanteda")` but also file an issue on the GitHub page, I will work with you from there. Need to know the source file encoding however and ideally to reproduce the problem. – Ken Benoit Jan 13 '16 at 18:46