I have a pyspark data frame with data shaped like the following (data made up):
Dataframe
I would like to calculate various distance metrics (such as cosine, euclidean) between the 2 vectors, vec1 and vec2, for each id in the dataframe, where element a of vec1 corresponds with element a of vec2 etc. The vectors consist of elements which are all stored in columns across the dataframe. I'd like this to result in an additional column for the given distance metric as seen in the image ('euclidean_vec1_vec2').
So far I have the following:
import pandas as pd
from scipy.spatial import distance
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
data = {'id': ['a', 'b', 'c', 'd'],
'vec1a': [0.4, 1.1, 2.3, 3.0], 'vec1b': [1.4, 2.3, 3.9, 4.7], 'vec1c': [0.8, 1.3, 2.2, 3.5],
'vec2a': [0.0, 1.5, 2.7, 3.1], 'vec2b': [0.6, 1.3, 2.0, 3.8], 'vec2c': [0.4, 1.4, 2.6, 3.2]}
df = spark.createDataFrame(pd.DataFrame(data))
vec1 = df.select(F.array(*['vec1' + l for l in ['a', 'b', 'c']]))
vec2 = df.select(F.array(*['vec2' + l for l in ['a', 'b', 'c']]))
distance_udf = F.udf(lambda vec1, vec2: float(distance.euclidean(vec1, vec2)), FloatType())
df = df.withColumn('euclidean_vec1_vec2', distance_udf(vec1, vec2))
However this results in an error "TypeError: Invalid argument, not a string or column: DataFrame[array(vec1a, vec1b, vec1c): array] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function"
UPDATED: I resolved this by removing the df.select and feeding the arrays directly