I'm using Pyspark to query from Elasticsearch and then generate Json & Pickle files.
My Elasticsearch index sr-data-index
has a field called word_embedding
which is of type DENSE_VECTOR
. Using elasticsearch-spark
connector and able to query from Elasticsearch. Here is my code -
es_reader = spark.read \
.format("org.elasticsearch.spark.sql") \
.option("inferSchema", "true") \
.option("es.nodes", "********") \
.option("es.port", "9200") \
full_sr_data_df = es_reader \
.option("es.mapping.date.rich", False)\
.load('sr-data-index').limit(10000)
However, the DENSE_VECTOR field is not loaded. All other fields are working.
Tried using a function :
def get_dense_vector_values(doc_id):
query = {"query": {"terms": {"_id": [doc_id]}}}
response = es.search(index="sr-data-index", body=query)
dense_vector_value = response["hits"]["hits"][0]["_source"]["word_embedding"]
return doc_id, dense_vector_value
dense_vector_values_df = full_sr_data_df.select(col("Incident_number")).rdd.map(lambda row: get_dense_vector_values(row))
But this throws TypeError: can't pickle _thread.lock objects
Tried the same as UDF, same error.
How do I load the word_embedding
field ?