2

My table is stored in pyspark in databricks. The table has two columns id and text. I am trying to get a dense vector for the text column. I have a ML model to generate the text dense representation into a new column called dense_embedding. The model generate a numpy array to represent the input text. `` work like this model.encode(text_input). I want to use this model to generate the all text dense representation for the column text.

Here is what I did:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import *
import pandas as pd

# Use pandas_udf to define a Pandas UDF
@pandas_udf('???', PandasUDFType.SCALAR)
# Input/output are text and dense vector
def embedding(v):
    return Vectors.dense(model.encode([v]))

small.withColumn('dense_embedding', embedding(small.text))

I am not sure is what data type shall I put into the pandas_udf function? is it correct to convert dense_vector like what I did?

HHKSHD_HH
  • 73
  • 1
  • 8
  • try using `VectorUDT` (first do `from pyspark.ml.linalg VectorUDT`) – pault Sep 19 '19 at 21:51
  • 1
    Possible duplicate of [What Type should the dense vector be, when using UDF function in Pyspark?](https://stackoverflow.com/questions/49623620/what-type-should-the-dense-vector-be-when-using-udf-function-in-pyspark) – pault Sep 19 '19 at 21:52

0 Answers0