0

I am trying to do LDA Topic Analysis using SparkR. I am not sure what is the format of the input file.

I have a cleaned text file (I am working with the 20 Newsgroup) which I created in R. I save it as CSV, and then read it with read.df to have a SparkDataFrame:

df <- read.df("text.example.csv", "csv", header=FALSE, inferSchema = "true")

However, when I run spark.lda:

model <- spark.lda(df, k = 10, maxIter = 500, optimizer="online")

I get an error:

16/12/30 18:51:37 ERROR RBackendHandler: fit on org.apache.spark.ml.r.LDAWrapper failed
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

(and many more lines), which I guess is because of the input.

Does anyone know how to successfully run LDA in SparkR?

Jaime Caffarel
  • 2,401
  • 4
  • 30
  • 42
Andres
  • 281
  • 2
  • 13
  • Here is a example data: https://www.dropbox.com/s/9b29qe14otcjsx9/example.zip?dl=0 – Andres Dec 30 '16 at 18:07
  • I have not used spark.lda, but I wonder what your "text.example.csv" dataframe looks like. Usually for LDA you want to pass it a term-document matrix not strings of text. Looking at `spark.lda` I do not see any mention of tokenization which makes me think you need a tdm. – emilliman5 Dec 30 '16 at 18:11
  • I think you are right @emilliman5. However, I am still getting errors when trying to convert to SparkDataFrame... Will post my code... – Andres Dec 30 '16 at 18:23

0 Answers0