Using elasticsearch-spark connector in Pyspark , unable to get DENSE_VECTOR field from Elasticsearch

Question

I'm using Pyspark to query from Elasticsearch and then generate Json & Pickle files.

My Elasticsearch index sr-data-index has a field called word_embedding which is of type DENSE_VECTOR. Using elasticsearch-spark connector and able to query from Elasticsearch. Here is my code -

es_reader = spark.read \
.format("org.elasticsearch.spark.sql") \
.option("inferSchema", "true") \
.option("es.nodes", "********") \
.option("es.port", "9200") \


full_sr_data_df = es_reader \
    .option("es.mapping.date.rich", False)\
    .load('sr-data-index').limit(10000)

However, the DENSE_VECTOR field is not loaded. All other fields are working.

Tried using a function :

def get_dense_vector_values(doc_id):
    query = {"query": {"terms": {"_id": [doc_id]}}}
    response = es.search(index="sr-data-index", body=query)
    dense_vector_value = response["hits"]["hits"][0]["_source"]["word_embedding"]
    return doc_id, dense_vector_value

dense_vector_values_df = full_sr_data_df.select(col("Incident_number")).rdd.map(lambda row: get_dense_vector_values(row))

But this throws TypeError: can't pickle _thread.lock objects Tried the same as UDF, same error.

How do I load the word_embedding field ?

Using elasticsearch-spark connector in Pyspark , unable to get DENSE_VECTOR field from Elasticsearch

0 Answers0