1

I'm trying to implement LDA using Spark and got this error. I'm totally new to Spark, so any help is appreciated.

[root@sandbox ~]# spark-submit ./lda.py
Traceback (most recent call last):
  File "/root/./lda.py", line 3, in <module>
    from pyspark.mllib.clustering import LDA, LDAModel
ImportError: cannot import name LDA

Here is the code:

from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import numpy
sc = SparkContext(appName="PythonLDA")
data = sc.textFile("/tutorial/input/askreddit20150801.txt")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)

# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(3):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
        print(" " + str(topics[word][topic]))

# Save and load model
model.save(sc, "myModelPath")
sameModel = LDAModel.load(sc, "myModelPath")

When I tried to install pyspark.mllib.clustering:

[root@sandbox ~]# pip install spark.mllib.clustering
Collecting spark.mllib.clustering
/usr/lib/python2.6/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Could not find a version that satisfies the requirement spark.mllib.clustering (from versions: )
No matching distribution found for spark.mllib.clustering
zero323
  • 322,348
  • 103
  • 959
  • 935
user1569341
  • 333
  • 1
  • 6
  • 17

1 Answers1

1

PySpark wrapper for LDA has been introduced in Spark 1.5.0. Assuming your installation hasn't been corrupted you probably use Spark <= 1.4.x.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • So I should update my Spark to 1.5.0? How do I do that? I'm running it on an EC2 by following these instructions: https://d396qusza40orc.cloudfront.net/cloudapplications/tutorials/week1/W1_AWS%20Tutorial.pdf and https://d396qusza40orc.cloudfront.net/cloudapplications/tutorials/week2/W2_SparkMR%20Tutorial.pdf – user1569341 Nov 04 '15 at 23:28
  • If you want to use LDA with PySpark yes, but to be honest Python interface is rather limited. I wouldn't bother with it for now. – zero323 Nov 04 '15 at 23:47
  • Hmm ok thanks for the information. I should use Java then? Can LDA work in Spark 1.3.1 for Java? – user1569341 Nov 05 '15 at 02:09
  • Scala (or Java) if you want to use 1.3 but honestly if all you want is learn/test I would recommend 1.5, In standalone mode Spark requires very little configuration to work on small or medium datasets. – zero323 Nov 05 '15 at 02:19
  • Actually I need to analyze a quite large data set, which is why I had to switch over to Spark. I guess with Spark 1.3 I'm going to switch to Java. – user1569341 Nov 05 '15 at 02:28