I'm trying to tune a LLM (Bert or embeddings such as Glove) on a text column for text classification.
I'm using SparkNLP for preprocessing and creating the embeddings, and PySpark (Spark ML) for the machine learning part.
I'm at a point that I have the embeddings (used SparkNLP), and I want to change the output to dense vector, because it seems that the output of SparkNLP EmbeddingFinisher annotator is a sparse matrix.
I have the Scala code for this purpose which is as follows:
val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
.map { row =>
val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
(vector.size, vector)
}.toDF("size", "vector")
And I wrote this code mostly using chatGPT as I'm not familiar with Spark UDTs (I didn't include the size as I don't need it). However, the code I came up with, throws different errors including Type Errors and Value Errors. In the code, "input_df" is the dataset which contains the 'finished_sentence_embeddings' column (sparse vectors- the output of SparkNLP EmbeddingFinisher annotator). Here is my code:
def to_dense_vector(vector):
return DenseVector(vector)
def process_vectors(input_df):
dense_vectors = input_df.select(explode("finished_sentence_embeddings").alias("vector"))
for i in range(len(dense_vectors.first().vector)):
dense_vectors = dense_vectors.withColumn(f"element_{i}", to_dense_vector(col("vector")[i]))
return dense_vectors
Now, I'd appreciate it if you can correct my code or let me know what's wrong with the translation of the code? (I know I could use a lambda function for the first function, to_dense_vector).