0

My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity

I am able to connect to my elasticsearch node using below curl command

curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure

but when it comes to connection with spark I am unable to do so. My command to insert data is df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')

Error i am getting is

org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...

Here, What would be the pyspark equivalent of curl --insecure

Thanks

Ayush Goyal
  • 415
  • 4
  • 23

3 Answers3

1

After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely

        dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
        .option("es.net.http.auth.user", username) \
        .option("es.net.http.auth.pass", password) \
        .option("es.net.ssl", "true") \
        .option("es.net.ssl.cert.allow.self.signed", "true") \
        .option("mergeSchema", "true") \
        .option('es.index.auto.create', 'true') \
        .option('es.nodes', 'https://{}'.format(es_ip)) \
        .option('es.port', '9200') \
        .option('es.batch.write.retry.wait', '100s') \
        .save('{index}/_doc'.format(index=index))

with the

(es.net.ssl, true)

We also have to provide self signed certificate like below

(es.net.ssl.cert.allow.self.signed, true)
Ayush Goyal
  • 415
  • 4
  • 23
1

I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.

  1. In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
  2. Check connectivity from EMR master node, with a telnet command
    telnet xyz.eu-west-1.es.amazonaws.com 443
  1. Once check above, check app level with curl command

    curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
    
  2. After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:

     spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
    
    1. Finally the code to write:
    import java.util.Date
    import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
    import org.elasticsearch.spark._
    import org.elasticsearch.spark.sql._
    val dateformat =  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
    val currentdate = dateformat.format(new Date)
    val colorsDF = spark.read.json("multilinecolors.json")
    val mcolors = colorsDF.withColumn("Date",lit(currentdate))
    mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
    
    
    
    
    
    
    
Carlos Gomez
  • 200
  • 1
  • 12
0

can you try with the below sparkConfs,

val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")

still you face the problem then, es.net.ssl = true and see.

If still you get the error try adding the below configs,

'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'

sathya
  • 1,982
  • 1
  • 20
  • 37