1

I am a newbie on both spark and python. I am now having multiple vectors(TypeA) in hands and trying to compute their dot products with anthoer single vector(Type B). To speed up the progress, I'd like to implement this function with python3.4 on a spark cluster in order to deploy dot product computation of each TypeA and TypeB on different nodes. I have such codes below:

import numpy as np
from pyspark import SparkContext

sc=SparkContext()
#Type A Vectors
a=list([[1,2,3],[4,5,6]])
#Type B Vector
b=list([7,8,9])
result=np.dot(sc.parallelize(a).collect(),b)

The code above does produce correct answer, but my question is if the way i am coding with fulfil my original expectation? if not, can anyone possibly show my the correct approach?

Great Thanks in advance!

Buddhainside
  • 127
  • 1
  • 8

1 Answers1

2

Sorry dude, Spark is not doing anything in parallel there. Numpy is doing the dot product in the driver. The driver just sent the singleton list containing your a matrix out onto the cluster using parallelize and then brought it straight back using collect.

You need a matrix multiplication from MLLib. Maybe start looking here: simple matrix multiplication in Spark or here: Spark MLib Matrix Multiplication

Community
  • 1
  • 1
Alister Lee
  • 2,236
  • 18
  • 21