How to filter PySpark SQL dataframe read from Elasticsearch by metadata field (by _id for example)?

Asked Jun 05 '19 at 09:17

Active Jun 05 '19 at 09:17

Viewed 141 times

I am reading PySpark SQL Dataframe from Elasticsearch index, with the read option of es.read.metadata=True. I want to filter the data by condition on metadata field, but get an empty result, although there should be result. Is it possible to get the actual result?

I did get result when I used limit on the dataframe, even with a very big number, even larger then the dataframe size.

In addition, I did get result when using other not _metadata related field.

for example:

df.where(df._metadata._score > 1.0).select(df._metadata._id).show()

the result is empty:

+--------------+
|_metadata[_id]|
+--------------+
+--------------+

But when using limit:

df.limit(1000000).where(df._metadata._score > 1.0).select(df._metadata._id).show()

the result is not empty:

+--------------------+
|      _metadata[_id]|
+--------------------+
|cICqm2gBHl8Vy6RZyu_L|
+--------------------+

asked Jun 05 '19 at 09:17

David206

How to filter PySpark SQL dataframe read from Elasticsearch by metadata field (by _id for example)?

0 Answers0