My table is stored in pyspark in databricks. The table has two columns id
and text
. I am trying to get a dense vector for the text
column. I have a ML model to generate the text dense representation into a new column called dense_embedding
. The model generate a numpy array to represent the input text. ``
work like this model.encode(text_input)
. I want to use this model to generate the all text dense representation for the column text
.
Here is what I did:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import *
import pandas as pd
# Use pandas_udf to define a Pandas UDF
@pandas_udf('???', PandasUDFType.SCALAR)
# Input/output are text and dense vector
def embedding(v):
return Vectors.dense(model.encode([v]))
small.withColumn('dense_embedding', embedding(small.text))
I am not sure is what data type shall I put into the pandas_udf function? is it correct to convert dense_vector like what I did?