How to change a sparse vector column in a dataframe into a dense one with PySpark? (or how to translate my Scala function to a pySpark one?)

Question

I'm trying to tune a LLM (Bert or embeddings such as Glove) on a text column for text classification.

I'm using SparkNLP for preprocessing and creating the embeddings, and PySpark (Spark ML) for the machine learning part.

I'm at a point that I have the embeddings (used SparkNLP), and I want to change the output to dense vector, because it seems that the output of SparkNLP EmbeddingFinisher annotator is a sparse matrix.

I have the Scala code for this purpose which is as follows:

val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
  .map { row =>
    val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
    (vector.size, vector)
  }.toDF("size", "vector")

And I wrote this code mostly using chatGPT as I'm not familiar with Spark UDTs (I didn't include the size as I don't need it). However, the code I came up with, throws different errors including Type Errors and Value Errors. In the code, "input_df" is the dataset which contains the 'finished_sentence_embeddings' column (sparse vectors- the output of SparkNLP EmbeddingFinisher annotator). Here is my code:

def to_dense_vector(vector):
    return DenseVector(vector)

def process_vectors(input_df):
    dense_vectors = input_df.select(explode("finished_sentence_embeddings").alias("vector"))

    for i in range(len(dense_vectors.first().vector)):
        dense_vectors = dense_vectors.withColumn(f"element_{i}", to_dense_vector(col("vector")[i]))

    return dense_vectors

Now, I'd appreciate it if you can correct my code or let me know what's wrong with the translation of the code? (I know I could use a lambda function for the first function, to_dense_vector).

Does this help? I have done something similar when I migrated my code from 1.6 to 2.2 https://stackoverflow.com/questions/41328549/convert-sparse-vector-to-dense-vector-in-pyspark — user238607, Aug 25 '23 at 12:27

How to change a sparse vector column in a dataframe into a dense one with PySpark? (or how to translate my Scala function to a pySpark one?)

0 Answers0