Error while indexing huge data from MongoDB to Elasticsearch via Spark Python

Question

I am using 3-node spark cluster to read data from 1-node mongodb and index into 3-node elasticsearch cluster. Each node of spark cluster and elasticseach has 16GB ram and 4 cores. I am using pyspark.
By following case 1 of this link, I have 2 executors and each executor got 3 cores and 7520MB memory. Spark driver got 4000MB and driver.memoryOverhead is 800MB. MongoDB collection of size upto 400GB (around 100,000 spark partitions) is being indexed perfectly. But for more than that (tested for 600GB and more collections) I am facing Full GC loop and finally OutOfMemoryError: GC overhead limit exceeded.
My spark code is:

rdd = sc.mongoRDD('mongodb://user:pass@ip:port/db.coll?authSource=admin_db)
def include_id(doc):
    doc["mongo_id"] = str(doc.pop('_id', ''))
    doc["ip"] = m_ip_port
    return doc

def it_projection(doc_it):
    projection_list = m_projection
    for doc in doc_it:
        projected_doc = {i: doc.get(i) for i in m_projection = ['_id', 'key1', 'key2', 'key3']}
        yield projected_doc

json_rdd = rdd.mapPartitions(it_projection).map(include_id).map(json.dumps).map(lambda x: ('key', x))
json_rdd.saveAsNewAPIHadoopFile(
    path='-',
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf={
        "es.nodes" : e_ip_port,
        "es.resource" : e_index + '/' + e_type,
        "es.input.json": "true"
    }
)

Solutions I tried: I increased executor memory to 11GB but no effect. I increased driver memory to 8GB but no effect. I set num_of_executors to 5 but no effect.
Finally I decided to do driver heap analysis using JHAT. I collected 2 Java heaps, one while program was healthy and another when program was in Full GC loop. The top 10 rows of heap histogram are shown below:

Healthy Java Heap histogram Java Heap histogram while program is in Full GC loop

I observed that class org.apache.spark.scheduler.AccumulableInfo has increased significantly in size as well as count.
AccumulableInfo contains information about a task’s local updates to an Accumulable. Accumulator are required by driver to keep count of how many partitions have finished execution. It seems we need them.

So my question is what am I doing wrong? What can I do index this huge data into ES using Spark? Should I give more resources to spark? I want to use it for a MongoDB collection with few TBs of data.

Can you analyze cluster resources use in Spark UI ? Do you see something strange (failing tasks, resource peaks, ...) ? Maybe your cluster composition is totally fine and the data is not distributed evenly and one of nodes takes more than the rest and it fails ? — Bartosz Konieczny, Jul 18 '18 at 11:26

Error while indexing huge data from MongoDB to Elasticsearch via Spark Python

0 Answers0