I'm sure there is a quick fix here, but I'm having issues creating a udf for basic vector operations on a pyspark DF.
I have:
A Dense Vector of 300 dimensions
A Pyspark DF with a column of 500K Dense Vectors, each of 300 dimensions
In short, I want to find which of the (2) rows in the DF has the highest cosine similarity to the (1) Dense vector in question. After normalizing all of the vectors, I'll be able to achieve this by performing the dot product of the vector in question to each row and then returning the max.
My code:
value = df_other.select('vec_norm').collect()[0][0] #Pulling from another DF
def dot_product(vec):
dot_value = value.dot(DenseVector(vec[3]))
return dot_value
dot_product_udf = udf(dot_product, FloatType())
df_dot = df.withColumn('cos_dis',dot_product_udf(df['vec_norm']))
print df_dot.rdd.max(key=lambda x: x["cos_dis"])[0]
Error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
...
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 402, in __len__
return len(self.array)
TypeError: len() of unsized object
If I try to calculate using numpy, I have a similar issue:
...
def dot_product(vec):
#dot_value = value.dot(DenseVector(vec[3]))
dot_value = sum(value * DenseVector(vec[3]))
return dot_value
dot_product_udf = udf(dot_product, FloatType())
...
Error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 72.0 failed 1 times, most recent failure: Lost task 2.0 in stage 72.0 (TID 632, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
I've used the following questions for troubleshooting so far but can't resolve the issue (I'm guessing its a vector type issue):
- Issue with UDF on a column of Vectors in PySpark DataFrame
- Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
- Spark __getnewargs__ error
- Error with "len() of unsized object"
Any help/advice super appreciated!
EDIT
Sample data:
> print type(value), len(value), value
<class 'pyspark.ml.linalg.DenseVector'> 300 [0.0667470050056,0.0439160518808...]
> df_value = df.select('vec_norm').collect()[0][0]
> print len(df_value)
> df.select('vec_norm').show(truncate=100)
300 <class 'pyspark.ml.linalg.DenseVector'>
+----------------------------------------------------------------------------------------------------+
| vec_norm|
+----------------------------------------------------------------------------------------------------+
|[-0.033044380089015266,0.09674768943906177,0.08259697668541087,0.04247286602516604,0.037005449248...|
|[-0.06890003507034705,0.06019625255379143,0.04288672222032615,-2.714061064477613E-4,0.02868655951...|