Pyspark UDF Cosine Similiarty Error on Spark DF of Dense Vectors

Question

I'm sure there is a quick fix here, but I'm having issues creating a udf for basic vector operations on a pyspark DF.

I have:

A Dense Vector of 300 dimensions
A Pyspark DF with a column of 500K Dense Vectors, each of 300 dimensions

In short, I want to find which of the (2) rows in the DF has the highest cosine similarity to the (1) Dense vector in question. After normalizing all of the vectors, I'll be able to achieve this by performing the dot product of the vector in question to each row and then returning the max.

My code:

value = df_other.select('vec_norm').collect()[0][0] #Pulling from another DF
def dot_product(vec):
    dot_value = value.dot(DenseVector(vec[3]))
    return dot_value
dot_product_udf = udf(dot_product, FloatType())
df_dot = df.withColumn('cos_dis',dot_product_udf(df['vec_norm']))
print df_dot.rdd.max(key=lambda x: x["cos_dis"])[0]

Error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
...
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 402, in __len__
    return len(self.array)
TypeError: len() of unsized object

If I try to calculate using numpy, I have a similar issue:

...
def dot_product(vec):
    #dot_value = value.dot(DenseVector(vec[3]))
    dot_value = sum(value * DenseVector(vec[3]))
    return dot_value
dot_product_udf = udf(dot_product, FloatType())
...

Error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 72.0 failed 1 times, most recent failure: Lost task 2.0 in stage 72.0 (TID 632, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

I've used the following questions for troubleshooting so far but can't resolve the issue (I'm guessing its a vector type issue):

Any help/advice super appreciated!

EDIT

Sample data:

> print type(value), len(value), value 

<class 'pyspark.ml.linalg.DenseVector'> 300 [0.0667470050056,0.0439160518808...]


> df_value = df.select('vec_norm').collect()[0][0]
> print len(df_value)
> df.select('vec_norm').show(truncate=100)

300 <class 'pyspark.ml.linalg.DenseVector'>
+----------------------------------------------------------------------------------------------------+
|                                                                                            vec_norm|
+----------------------------------------------------------------------------------------------------+
|[-0.033044380089015266,0.09674768943906177,0.08259697668541087,0.04247286602516604,0.037005449248...|
|[-0.06890003507034705,0.06019625255379143,0.04288672222032615,-2.714061064477613E-4,0.02868655951...|

score 1 · Accepted Answer · answered Jan 23 '18 at 18:12

I've used the following questions for troubleshooting so far but can't resolve the issue (I'm guessing its a vector type issue):

...

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

...

And your answer is there:

def dot_product(vec):
    #dot_value = value.dot(DenseVector(vec[3]))
    dot_value = sum(value * DenseVector(vec[3]))
    return dot_value.tolist()

or more intuitively:

def dot_product(vec):
    #dot_value = value.dot(DenseVector(vec[3]))
    dot_value = sum(value * DenseVector(vec[3]))
    return float(dot_value)

Pyspark UDF Cosine Similiarty Error on Spark DF of Dense Vectors

1 Answers1