Apache Spark TFIDF using Python

Question

The Spark documentation states to use HashingTF feature, but I'm unsure what the transform function expects as input. http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

I tried running the tutorial code:

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF

sc = SparkContext()

# Load documents (one per line).
documents = sc.textFile("...").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

but I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/salloumm/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/pipeline.py", line 114, in transform
    return self._transform(dataset)
  File "/Users/salloumm/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/wrapper.py", line 148, in _transform
    return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
AttributeError: 'list' object has no attribute '_jdf'

I tried the first example shown in this link (The example in Python) http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf Used a simple text file as input. — user2388191, Apr 03 '16 at 04:20

zero323 · Answer 1 · 2016-04-03T13:49:50.180

3

Based on the error you've shown it is clear you don't follow the tutorial or use code included in the question.

This error is a result of using from pyspark.ml.feature.HashingTF instead of pyspark.mllib.feature.HashingTF. Just clean your environment and make sure you use correct imports.

edited Apr 03 '16 at 13:49

answered Apr 03 '16 at 05:45

zero323

322,348
103
959
935

Apache Spark TFIDF using Python

1 Answers1