0

I am using tm and Snowball packages in R for text mining. I initially ran it on my laptop that has Windows 7 with 8 GB memory. Later I tried the same on a Linux (Ubuntu) machine with 64 GB of memory. Both of these machines are 64 bit and am using 64 bit version of R as well. However, Windows has R 3.0.0 whereas Linux has R 2.14

Some of the commands are extremely slow in Linux when compared to Windows.

Corpus Command

On windows

    d <- data.frame(chatTranscripts$chatConcat)
    ds <- DataframeSource(d)
    t1 <- Sys.time()
    dsc<-Corpus(ds)
    print(Sys.time() - t1)
    Time difference of 46.86169 secs

This took only 47 secs on the Windows machine

On Linux

    t1 <- Sys.time()
    dsc<-Corpus(ds)
    print(Sys.time() - t1)
    Time difference of 3.674376 mins

This took around 220 secs on the Linux machine

Snowball Stemming

On windows

    t1 <- Sys.time()
    dsc <- tm_map(dsc,stemDocument)
    print(Sys.time() - t1)
    Time difference of 12.05321 secs

This took only 12 secs on the Windows machine

On Linux

    t1 <- Sys.time()
    dsc <- tm_map(dsc,stemDocument)
    print(Sys.time() - t1)
    Time difference of 4.832964 mins

This took around 290 secs on the Linux machine

Is there a way to speed these commands on the Linux machine? Will the R versions make such a big difference. Thank you.

Ravi

Ravi
  • 3,223
  • 7
  • 37
  • 49
  • It's possible that the R version could make a difference. There were some really big performance improvements in how R handles data frames in v2.15.1, a result of work by Tim Hesterberg. See http://blog.revolutionanalytics.com/2012/06/r-2151-dataframe-package.html – Andrie Feb 12 '14 at 10:40

1 Answers1

0

Corpus() on VectorSource() seems to be faster than Corpus() on DataframeSource().

You can try

d <- chatTranscripts$chatConcat
ds <- VectorSource(d)
Corpus(ds)
tweray
  • 1,002
  • 11
  • 18
Abir
  • 1