1

I am trying to run k-means using spark MLlib but I am getting Index out of range error.

I've splitted my very small sample input file and the output is like this:-

['hello', 'world', 'this', 'is', 'earth']
['what', 'are', 'you', 'trying', 'to', 'do']
['trying', 'to', 'learn', 'something']
['I', 'am', 'new', 'at', 'this', 'thing']
['what', 'about', 'you']

Now I'am using TFIDF code given by spark to have sparse representation. Output is :-

(1048576,[50570,432125,629096,921177,928731],  [1.09861228867,1.09861228867,0.69314718056,1.09861228867,1.09861228867])
(1048576,[110522,521365,697409,725041,749730,962395],[0.69314718056,1.09861228867,1.09861228867,0.69314718056,0.69314718056,0.69314718056])
(1048576,[4471,725041,850325,962395],[1.09861228867,0.69314718056,1.09861228867,0.69314718056])
(1048576,[36748,36757,84721,167368,629096,704697],[1.09861228867,1.09861228867,1.09861228867,1.09861228867,0.69314718056,1.09861228867])
(1048576,[110522,220898,749730],[0.69314718056,1.09861228867,0.69314718056])

And now I am running the k means algorithm given by MLlib in spark:-

clusters = KMeans.train(tfidf_vectors, 2, maxIterations=10)  

def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = tfidf_vectors.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

clusters.save(sc, "myModelPath")
sameModel = KMeansModel.load(sc, "myModelPath")

But I am getting Index out of range error at WSSSE step. what I am doing wrong?

Nicky
  • 333
  • 2
  • 4
  • 11
  • What does your output for `clusters` look like? – Rohan Aletty Oct 07 '15 at 05:26
  • I am not sure how to look at the output. I've a bunch of file in myModelPath folder which was created after running the program. If you can tell me which file, then I can get back to you. And clusters is not iterable so couldn't print it. – Nicky Oct 08 '15 at 03:46

1 Answers1

1

I've already encountered a similar problem today and it looks like is a bug. TFIDF creates SparseVectors like this:

>>> from pyspark.mllib.linalg import Vectors
>>> sv = Vectors.sparse(5, {1: 3})

and accessing value using index larger than an index of the last non-zero value leads to an exception:

>>> sv[0]
0.0
>>> sv[1]
3.0
>>> sv[2]
Traceback (most recent call last):
...
IndexError: index out of bounds

Quick, although not very efficient, workaround is to convert SparseVector to NumPy array:

def error(point):                                                         
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point.toArray() - center)]))
zero323
  • 322,348
  • 103
  • 959
  • 935