I am trying to write a pyspark DataFrame to an Elasticsearch instance running on Docker. I am unable to successfully connect to the Elasticsearch instance using elasticsearch-hadoop. When I try to save the DataFrame, I get an error that Elasticsearch could not be found. I suspect it has to do with security, and I noticed that the instance was secured by default (as of es version 8).
I set up the single node Elasticsearch instance on Docker by following the docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
I did confirm the cluster is healthy with curl --cacert http_ca.crt -u elastic https://localhost:9200/_cluster/health
.
I am running Pyspark 3.3.1, using the elasticsearch-spark-30_2.12 jar, and using Elasticsearch 8.5.
I tried to use basic auth with user and pass, like below. My expectation is that I need to specify the correct es
security parameters, but I have been unsuccessful so far. I have been reading through the configuration docs to try and figure out what to do: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html.
df \
.write \
.format("es") \
.option("es.nodes.wan.only", "true") \
.option("es.net.http.auth.user", "elastic") \
.option("es.net.http.auth.pass", "...") \
.option("es.nodes", "localhost") \
.option("es.port", "9200") \
.option('es.resource',"spark/test") \
.save()
When running the above, I get this error in logging output:
22/12/21 16:07:02 ERROR NetworkClient: Node [localhost:9200] failed (org.elasticsearch.hadoop.thirdparty.apache.commons.httpclient.NoHttpResponseException: The server localhost failed to respond); no other nodes left - aborting...
I also get this error from py4j:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
I suspect I may need to use es.net.ssl.keystore
instead, but I don't know much about this. Any help is appreciated.