1

My script is written in python, it was working well on DSE 4.8 without docker environment. Now I upgraded to DSE 5.0.4 and run it in a docker environment and now I got the below RPC error. Before I used DSE Spark version 1.4.1 now I am using 1.6.2.

Host OS Centos 7.2 and Docker OS is the same. We use spark to submit a task and we tried giving executors 2G, 4G, 6G and 8G and they all give the same error message.

The same python script ran without issues in my previous environment but now that I updated it doesn't work properly.

For the scala operations the code runs normal in the current environment, only the python part has the issue. Resetting the hosts still hasn't resolved the issue. Recreating the docker container also didn't help solving the issue.

EDIT:

Maybe my Mapreduce function is too complicated. The issue might be here but not sure.

Specs of environment: Cluster group by 6 host, every host has 16 cores CPU, 32G memory, 500G SSD。

Any idea how to fix this issue? Also what does this error message mean? many thanks! Let me know if you need more info.

Error log:

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
WARN  2017-02-26 10:14:08,314 org.apache.spark.scheduler.TaskSetManager: Lost task 47.1 in stage 88.0 (TID 9705, 139.196.190.79): TaskKilled (killed intentionally)
Traceback (most recent call last):
  File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 1116, in <module>
    compute_each_dimension_and_format_user(article_by_top_all_tmp)
  File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 752, in compute_each_dimension_and_format_user
    sqlContext.createDataFrame(article_up_save_rdd, df_schema).write.format('org.apache.spark.sql.cassandra').options(keyspace='archive', table='articles_up_update').save(mode='append')
  File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 395, in save
WARN  2017-02-26 10:14:08,336 org.apache.spark.scheduler.TaskSetManager: Lost task 63.1 in stage 88.0 (TID 9704, 139.196.190.79): TaskKilled (killed intentionally)
  File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o795.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 619 in stage 88.0 failed 4 times, most recent failure: Lost task 619.3 in stage 88.0 (TID 9746, 139.196.107.73): ExecutorLostFailure (executor 59 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$han

Docker command:

docker run -d --net=host -i --privileged \ 
    -e SEEDS=10.XX.XXx.XX 1,10.XX.XXx.XXX \ 
    -e CLUSTER_NAME="MyCluster" \ 
    -e LISTEN_ADDRESS=10.XX.XXx.XX \  
    -e BROADCAST_RPC_ADDRESS=139.XXX.XXX.XXX \ 
    -e RPC_ADDRESS=0.0.0.0 \
    -e STOMP_INTERFACE=10.XX.XXx.XX \ 
    -e HOSTS=139.XX.XXx.XX \ 
    -v /data/dse/lib/cassandra:/var/lib/cassandra \ 
    -v /data/dse/lib/spark:/var/lib/spark \ 
    -v /data/dse/log/cassandra:/var/log/cassandra \ 
    -v /data/dse/log/spark:/var/log/spark \
    -v /data/agent/log:/opt/datastax-agent/log \
    --name dse_container registry..xxx.com/rechao/dse:5.0.4 -s
peter
  • 674
  • 1
  • 12
  • 33
  • 1
    You updated more than just Datastax. You now use Docker, and the error clearly mentions `exceeding thresholds or network issues`, so what is your host OS and what memory allocation are you giving the executors? – OneCricketeer Feb 27 '17 at 05:42
  • @cricket_007 Host OS Centos 7.2 and Docker OS is the same. We use spark to submit a task and we tried giving executors 2G, 4G, 6G and 8G and they all give the same error message. Any idea why? Thanks – peter Feb 27 '17 at 05:49
  • 1
    Okay, then it's likely a networking issue. Do the containers expose the appropriate ports? – OneCricketeer Feb 27 '17 at 05:49
  • I run docker in host mode so it doesn't need to map port to host in the production environment. Is this correct? – peter Feb 27 '17 at 05:52
  • 1
    You mean `--net=host`? I don't know. Never tried it, but I have read that it works "unexpectedly" compared to what you might think – OneCricketeer Feb 27 '17 at 06:03
  • Yes that's what I mean. I added a couple of infos to my post above. – peter Feb 27 '17 at 06:05
  • 1
    Can you add your related docker commands? – OneCricketeer Feb 27 '17 at 06:06
  • @cricket_007 just added it to the post. also maybe it's related to my map reduce function I am not sure – peter Feb 27 '17 at 06:17
  • 1
    how much resources did u allocate for your spark job ( cpu/memory/ executor/executor mem..)? And how many partitions did u use during your spark operation? – tauitdnmd Feb 27 '17 at 06:37
  • 2
    I would at least try this same code without using Docker. It's not clear why you thought that adding Docker into the mix was a good idea. – OneCricketeer Feb 27 '17 at 06:38
  • @tauitdnmd I 3-6 but I don't know because it's set to automatic. I tried tried giving executors 2G, 4G, 6G and 8G and they all give the same error message. 16 cores, 32GB memory, SSD 500G – peter Feb 27 '17 at 08:44
  • @cricket_007 I thought it makes deployment much easier and also simplifies server upgrade/migration, which is what I recently did. I don;t have the code without docker up so would need to reinstall everything to try this – peter Feb 27 '17 at 08:48
  • Are you you using the standalone scheduler? I would think if using Docker/containers, then using Mesos would be preferred – OneCricketeer Feb 27 '17 at 13:07

1 Answers1

0

docker is fine, increase host memory to 64G can fix this issue.

peter
  • 674
  • 1
  • 12
  • 33