Why does Spark 1.6.2 RPC Error message occur?

Question

My script is written in python, it was working well on DSE 4.8 without docker environment. Now I upgraded to DSE 5.0.4 and run it in a docker environment and now I got the below RPC error. Before I used DSE Spark version 1.4.1 now I am using 1.6.2.

Host OS Centos 7.2 and Docker OS is the same. We use spark to submit a task and we tried giving executors 2G, 4G, 6G and 8G and they all give the same error message.

The same python script ran without issues in my previous environment but now that I updated it doesn't work properly.

For the scala operations the code runs normal in the current environment, only the python part has the issue. Resetting the hosts still hasn't resolved the issue. Recreating the docker container also didn't help solving the issue.

EDIT:

Maybe my Mapreduce function is too complicated. The issue might be here but not sure.

Specs of environment: Cluster group by 6 host, every host has 16 cores CPU, 32G memory, 500G SSD。

Any idea how to fix this issue? Also what does this error message mean? many thanks! Let me know if you need more info.

Error log:

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
WARN  2017-02-26 10:14:08,314 org.apache.spark.scheduler.TaskSetManager: Lost task 47.1 in stage 88.0 (TID 9705, 139.196.190.79): TaskKilled (killed intentionally)
Traceback (most recent call last):
  File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 1116, in <module>
    compute_each_dimension_and_format_user(article_by_top_all_tmp)
  File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 752, in compute_each_dimension_and_format_user
    sqlContext.createDataFrame(article_up_save_rdd, df_schema).write.format('org.apache.spark.sql.cassandra').options(keyspace='archive', table='articles_up_update').save(mode='append')
  File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 395, in save
WARN  2017-02-26 10:14:08,336 org.apache.spark.scheduler.TaskSetManager: Lost task 63.1 in stage 88.0 (TID 9704, 139.196.190.79): TaskKilled (killed intentionally)
  File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o795.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 619 in stage 88.0 failed 4 times, most recent failure: Lost task 619.3 in stage 88.0 (TID 9746, 139.196.107.73): ExecutorLostFailure (executor 59 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$han

Docker command:

docker run -d --net=host -i --privileged \ 
    -e SEEDS=10.XX.XXx.XX 1,10.XX.XXx.XXX \ 
    -e CLUSTER_NAME="MyCluster" \ 
    -e LISTEN_ADDRESS=10.XX.XXx.XX \  
    -e BROADCAST_RPC_ADDRESS=139.XXX.XXX.XXX \ 
    -e RPC_ADDRESS=0.0.0.0 \
    -e STOMP_INTERFACE=10.XX.XXx.XX \ 
    -e HOSTS=139.XX.XXx.XX \ 
    -v /data/dse/lib/cassandra:/var/lib/cassandra \ 
    -v /data/dse/lib/spark:/var/lib/spark \ 
    -v /data/dse/log/cassandra:/var/log/cassandra \ 
    -v /data/dse/log/spark:/var/log/spark \
    -v /data/agent/log:/opt/datastax-agent/log \
    --name dse_container registry..xxx.com/rechao/dse:5.0.4 -s

You updated more than just Datastax. You now use Docker, and the error clearly mentions `exceeding thresholds or network issues`, so what is your host OS and what memory allocation are you giving the executors? — OneCricketeer, Feb 27 '17 at 05:42
@cricket_007 Host OS Centos 7.2 and Docker OS is the same. We use spark to submit a task and we tried giving executors 2G, 4G, 6G and 8G and they all give the same error message. Any idea why? Thanks — peter, Feb 27 '17 at 05:49
Okay, then it's likely a networking issue. Do the containers expose the appropriate ports? — OneCricketeer, Feb 27 '17 at 05:49
I run docker in host mode so it doesn't need to map port to host in the production environment. Is this correct? — peter, Feb 27 '17 at 05:52
You mean `--net=host`? I don't know. Never tried it, but I have read that it works "unexpectedly" compared to what you might think — OneCricketeer, Feb 27 '17 at 06:03
Yes that's what I mean. I added a couple of infos to my post above. — peter, Feb 27 '17 at 06:05
@cricket_007 just added it to the post. also maybe it's related to my map reduce function I am not sure — peter, Feb 27 '17 at 06:17
how much resources did u allocate for your spark job ( cpu/memory/ executor/executor mem..)? And how many partitions did u use during your spark operation? — tauitdnmd, Feb 27 '17 at 06:37
I would at least try this same code without using Docker. It's not clear why you thought that adding Docker into the mix was a good idea. — OneCricketeer, Feb 27 '17 at 06:38
@tauitdnmd I 3-6 but I don't know because it's set to automatic. I tried tried giving executors 2G, 4G, 6G and 8G and they all give the same error message. 16 cores, 32GB memory, SSD 500G — peter, Feb 27 '17 at 08:44
@cricket_007 I thought it makes deployment much easier and also simplifies server upgrade/migration, which is what I recently did. I don;t have the code without docker up so would need to reinstall everything to try this — peter, Feb 27 '17 at 08:48
Are you you using the standalone scheduler? I would think if using Docker/containers, then using Mesos would be preferred — OneCricketeer, Feb 27 '17 at 13:07

score 0 · Accepted Answer · answered Mar 14 '17 at 15:05

0

docker is fine, increase host memory to 64G can fix this issue.

answered Mar 14 '17 at 15:05

peter

674
1
12
33

Why does Spark 1.6.2 RPC Error message occur?

1 Answers1