1

I am copying and pasting the exact Spark MLlib LDA example from here: http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

I am trying the Scala sample code, but I am having the following errors when I am trying to save and load the LDA model:

  1. on the line before the last line: value saveis not a member is not a member of org.apach.spark.mllib.clustering.DistributedLDAModel
  2. on the last line: not found: value DistributedLDAModel

Here is the code, knowing that I am using SBT to create my Scala project skeleton and to load the libraries then I import it into Eclipse (Mars) for editing, I am using spark-core 1.5.0 and spark-mllib 1.3.1 and Scala version 2.11.7

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors

object sample {
    def main(args: Array[String]) {
       val conf = new SparkConf().setAppName("sample_SBT").setMaster("local[2]")
       val sc = new SparkContext(conf)
       // Load and parse the data
       val data = sc.textFile("data/mllib/sample_lda_data.txt")
       val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
        // Index documents with unique IDs
        val corpus = parsedData.zipWithIndex.map(_.swap).cache()

        // Cluster the documents into three topics using LDA
        val ldaModel = new LDA().setK(3).run(corpus)

        // Output topics. Each is a distribution over words (matching word count vectors)
        println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):")
        val topics = ldaModel.topicsMatrix
        for (topic <- Range(0, 3)) {
            print("Topic " + topic + ":")
            for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); }
            println()
        }

        // Save and load model.
        ldaModel.save(sc, "myLDAModel")
        val sameModel = DistributedLDAModel.load(sc, "myLDAModel")
    }
}
zero323
  • 322,348
  • 103
  • 959
  • 935
Rami
  • 8,044
  • 18
  • 66
  • 108
  • 2
    Why are you intermixing spark-mllib with version **1.3.1** and spark-core with version **1.5.0** ???? – Martin Senne Sep 17 '15 at 10:46
  • because i am really beginner :) I should read more about it but I didn't know how to check what are the latest version of both libraries and that I have to use the same version numbers for both... sorry :) – Rami Sep 17 '15 at 11:00

2 Answers2

1

First, code compiles fine. Things I used for setup:

./build.sbt

name := "SO_20150917"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
  "org.apache.spark"     %% "spark-core"    % "1.5.0",
  "org.apache.spark"     %% "spark-mllib"   % "1.5.0"
)

./src/main/scala/somefun/

package somefun

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors

object Example {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("sample_SBT").setMaster("local[2]")
    val sc = new SparkContext(conf)
    // Load and parse the data
    val data = sc.textFile("data/mllib/sample_lda_data.txt")
    val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
    // Index documents with unique IDs
    val corpus = parsedData.zipWithIndex.map(_.swap).cache()

    // Cluster the documents into three topics using LDA
    val ldaModel = new LDA().setK(3).run(corpus)

    // Output topics. Each is a distribution over words (matching word count vectors)
    println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):")
    val topics = ldaModel.topicsMatrix
    for (topic <- Range(0, 3)) {
      print("Topic " + topic + ":")
      for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); }
      println()
    }

    // Save and load model.
    ldaModel.save(sc, "myLDAModel")
    val sameModel = DistributedLDAModel.load(sc, "myLDAModel")
  }
}

The execution via sbt run (of course) barks for missing "data/mllib/sample_lda_data.txt" as

[error] (run-main-0) org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/martin/IdeaProjects/SO_20150917/data/mllib/sample_lda_data.txt

@Rami: Thus, please check your setup, as everything is alright from my point of view.

Martin Senne
  • 5,939
  • 6
  • 30
  • 47
  • Brilliant @Martin... actually, is there anyway in SBT to ask to use the latest versions of the libraries in order to prevent messing up with the version numbers? – Rami Sep 17 '15 at 10:53
  • 1
    1. Check out http://www.scala-sbt.org/0.13/tutorial/Library-Dependencies.html#Ivy+revisions 2. see my additional answer :) – Martin Senne Sep 17 '15 at 10:57
1

As to @Rami's question:

Maybe this helps:

val sparkVersion = "1.5.0"

libraryDependencies ++= Seq(
  "org.apache.spark"     %% "spark-core"    % sparkVersion,
  "org.apache.spark"     %% "spark-mllib"   % sparkVersion
)
Martin Senne
  • 5,939
  • 6
  • 30
  • 47
  • 2
    Could you update your earlier and now approved answer with this and delete it afterwards? – Jacek Laskowski Sep 17 '15 at 11:51
  • @Jacek: Not to argue, but: 1. I totally agree from a context point of view and I support clean answers! 2. From a reputation point of view, I disagree as it costs 10 rep. Something for meta? Or you upvote my answer after merge!? – Martin Senne Sep 17 '15 at 12:01