Converting a Spark dataframe to Term document matrix in R using sparklyr

Question

I have a code in R which needs to be scaled to use big data. I am using Spark for this and the package that seemed most convenient was sparklyr. However, I am unable to create a TermDocument matrix from a Spark dataframe. Any help would be great.

input_key is the dataframe having the following schema.

ID  Keywords
 1   A,B,C
 2   D,L,K
 3   P,O,L

My code in R was the following.

mycorpus <- input_key

corpus <- Corpus(VectorSource(mycorpus$Keywords))

path_matrix <- TermDocumentMatrix(corpus)

score 1 · Answer 1 · answered Feb 01 '19 at 17:57

Such direct attempt won't work. Sparklyr tables are just views of underlying JVM objects and are not compatible with generic R packages.

While some capability to invoke arbitrary R code through sparklyr::spark_apply, the input and the output have to be data frame, and it is unlikely to translate to your particular use case.

If you committed to using Spark / sparklyr you should rather consider rewriting your pipeline using built-in ML transformers, as well as 3rd party Spark packages like Spark CoreNLP interface or John Snow Labs Spark NLP.

Converting a Spark dataframe to Term document matrix in R using sparklyr

1 Answers1