8

Amazon EMR, Apache Spark 2.3, Apache Kafka, ~10 mln records per day.

Apache Spark used for processing events in batches by 5 minutes, once per day worker nodes are dying and AWS reprovision automatically the nodes. On reviewing the log messages it looks like no space in the nodes, but they are having about 1Tb storage there.

Did someone has the issues with storage space in cases when it should be more than enough?

I was thinking the log aggregation could not copy properly the logs to s3 bucket, that should be done automatically by spark process as I see.

What kind of the information should I provide to help to resolve this issue?

Thank you in advance!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
oivoodoo
  • 774
  • 7
  • 23
  • 1
    Hi, thanks for an interesting question. Do you have some more context, i.e. why the nodes are dying ? Did you watch at the resource use - to get an idea whether is because of the application (e.g. OOM) or hardware ? Does you stream is consistent in whole day or maybe it has more work in certain periods ? The failures I saw of similar type (Spark Streaming on EMR) was caused by the logs files which were never rollbacked. At some point HDFS didn't have no more space left and it started to fail. – Bartosz Konieczny Oct 19 '18 at 06:52
  • Hi @bartosz25 . in case if you saw the issues caused by the logs files, have you solved it somehow? – oivoodoo Oct 23 '18 at 13:05
  • @bartosz25 some notes about the issue : https://gist.github.com/oivoodoo/f4272b73b3c732576b8ab23427357155 – oivoodoo Oct 23 '18 at 13:08
  • @bartosz25 https://gist.github.com/oivoodoo/989fe67dee0712c39e7a3162b4ec8a5d – oivoodoo Oct 23 '18 at 18:46
  • @oivoodoo why are you considering is a storage-related problem? can you share the error from executors and also from the driver? – Cosmin Oct 24 '18 at 10:29
  • @Cosmin I don't have the issues from Spark, node didn't respond for master. `Last state change reason:Master was unable to communicate with this instance.` job in hadoop dashboard has FAILED state and TIMEOUT error on Log aggregation. – oivoodoo Oct 24 '18 at 11:36

2 Answers2

2

I had a similar issue with a Structured Streaming app on EMR, and disk space rapidly increasing to the point of stalling/crashing application.

In my case the fix was to disable the Spark Event log:

spark.eventLog.enabled to false

http://queirozf.com/entries/spark-streaming-commong-pitfalls-and-tips-for-long-running-streaming-applications#aws-emr-only-event-logs-under-hdfs-var-log-spark-apps-when-using-a-history-server

bp2010
  • 2,342
  • 17
  • 34
  • I fixed it by using custom log4j.properties in /etc/spark/ . you can find example here https://gist.github.com/oivoodoo/d34b245d02e98592eff6a83cfbc401e3 – oivoodoo Nov 01 '18 at 09:43
0

I believe I fixed the issue using the custom log4j.properties, on deploy to Amazon EMR I replaced /etc/spark/log4j.properties and then run spark-submit with my streaming application.

Now it's working well.

https://gist.github.com/oivoodoo/d34b245d02e98592eff6a83cfbc401e3

Also it could be helpful for someone who is using streaming application and need to rollout the updates with graceful stop.

https://gist.github.com/oivoodoo/4c1ef67544b2c5023c249f21813392af

https://gist.github.com/oivoodoo/cb7147a314077e37543fdf3020730814

oivoodoo
  • 774
  • 7
  • 23