1

Seeing low # of writes to elasticsearch using spark java.

Here are the Configurations

using 13.xlarge machines for ES cluster

 4 instances each have 4 processors.
 Set refresh interval to -1 and replications to '0' and other basic 
 configurations required for better writing.

Spark :

2 node EMR cluster with

 2 Core instances
  - 8 vCPU, 16 GiB memory, EBS only storage
  - EBS Storage:1000 GiB

1 Master node
  - 1 vCPU, 3.8 GiB memory, 410 SSD GB storage

ES index has 16 shards defined in mapping.

having below config when running job,

executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4

and using

es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false

with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.

and in the spark log am seeing each task

 -1116 bytes result sent to driver

Spark Code

    JavaRDD<String> javaRDD = jsc.textFile("<S3 Path>");
    JavaEsSpark.saveJsonToEs(javaRDD,"<Index name>");

Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?

OR

Is there any other config that I am missing to tweak. Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance

Thanks in advance

1 Answers1

0

You may have an issue here. spark.executor.instances=2

You are limited to two executors, where you could have 4 based on your cluster configuration. I would change this to 4 or greater. I might also try executor-memory = 1500M, cores=1, instances=16. I like to leave a little overhead in my memory, which is why I dropped from 2G to 1.5G(but you can't do 1.5G so we have to do 1500M). If you are connecting via your executors this will improve performance.

Would need some code to debug further. I wonder if you are connected to elastic search only in your driver, and not in your worker nodes. Meaning you are only getting one connection instead of one for each executor.

Dan Ciborowski - MSFT
  • 6,807
  • 10
  • 53
  • 88
  • Thank you so much, Dan, When you say increase executors to 4, you mean increase the EMR cluster to have 4 instances instead of 2? The way I am connecting to ES is via the below code. SparkConf conf = new SparkConf().setAppName("SparkES Application"); – camelBeginner Oct 18 '17 at 16:40
  • SparkConf conf = new SparkConf().setAppName("SparkES Application"); conf.set("es.nodes",""); conf.set("es.batch.size.bytes","6mb"); conf.set("es.batch.size.entries","10000"); conf.set("es.batch.concurrent.request","4"); conf.set("es.batch.write.refresh","false"); conf.set("spark.kryoserializer.buffer","24"); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD javaRDD = jsc.textFile("S3 PATH"); JavaEsSpark.saveJsonToEs(javaRDD,"Index name"); – camelBeginner Oct 18 '17 at 16:49
  • and the last two line above is in a method and called from main () and i send a parameter to use in the method loadSNindex(jsc); – camelBeginner Oct 18 '17 at 18:34
  • Also, i verified the # of connections in one of the es node when i am running with 8 executors and 2 cores , i see 4 established connections for port 9200. – camelBeginner Oct 18 '17 at 19:11
  • @camelBeginner "When you say increase executors to 4, you mean increase the EMR cluster to have 4 instances instead of 2? ", no, I mean set 'spark.executor.instances' to 4 instead of 2. Nothing to do with how many VMs you are using. – Dan Ciborowski - MSFT Oct 19 '17 at 17:58
  • Ok, I tried with 4 and more 'spark.executor.instances' and still seeing the same performance. also tried to see # of 9200 connections at es and i see multiple connections, or is there way i can tell whether i use one es connection via driver or each executor uses its own connection? – camelBeginner Oct 19 '17 at 18:22