1

I have two dataframes. For simplicity assume, they each have only one entry

+--------------------+                                                          
|        entry       |    
+--------------------+
|[0.34, 0.56, 0.87]  |
+--------------------+

+--------------------+                                                          
|        entry       |    
+--------------------+
|[0.12, 0.82, 0.98]  |
+--------------------+

How can I compute the euclidean distance between the entries of these two dataframes? Right now I have the following code:

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from scipy.spatial import distance

inference = udf(lambda x, y: float(distance.euclidean(x, y)), DoubleType())

inference_result = inference(a, b)

but I get the following error:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/spark/python/pyspark/sql/udf.py", line 197, in wrapper
 return self(*args)
 File "/usr/lib/spark/python/pyspark/sql/udf.py", line 177, in __call__
 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
 File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in _to_seq
 cols = [converter(c) for c in cols]
 File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in <listcomp>
 cols = [converter(c) for c in cols]
 File "/usr/lib/spark/python/pyspark/sql/column.py", line 56, in _to_java_column
 "function.".format(col, type(col)))
 TypeError: Invalid argument, not a string or column: DataFrame[embedding: 
 array<float>] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column 
 literals, use 'lit', 'array', 'struct' or 'create_map' function.
A.M.
  • 1,757
  • 5
  • 22
  • 41

0 Answers0