1

I'm using Sagemaker connecting to an EMR cluster via sparkmagic and livy, very frequently I get(at session startup, not running any code):

> The code failed because of a fatal error:     Session <ID> unexpectedly
> reached final status 'dead'. See logs: stdout: 
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map <bytes num> bytes for committing reserved memory.
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid24900.log

I have tested some solutions found on the web like reducing drive and worker memory, but is not working even when setting small memory size(512MB), question is, how can I fix it? or how to debug it considering I'm not an admin and I don't have access to livy server nor cluster OS? where is the log(which host) described in the error(/tmp/hs_err_pid24900.log)?

Update I found this happens only when using:

spark.yarn.dist.archives

To distribute a conda environment(in tar.gz) to the driver and workers, the weird thing is, If I remove this, even after making driver and executor memory really big it doesn't complain about memory issues, so I know there is sufficient memory, but adding this makes it crash, is there any java property or java limit triggered when spark.yarn.dist.archives is used?

Luis Leal
  • 3,388
  • 5
  • 26
  • 49

1 Answers1

0

It seems like you are working with large size data, so the fix should be to increase:

  • The number of worker nodes
  • The memory size of each worker node (not decrease)

For further debug, you should be able to access the master node through SSH, and then retrieve the log file in that folder.

Also there are a couple of magics you can use for debugging: %%info and %%logs

You just need to load sparkmagic in order to be able to talk to Spark from your Python notebook. With it, the %manage_spark line magic and the %%spark magic are available.

%load_ext sparkmagic.magics

If you want to look at the Livy logs for this session, simply run a cell like so:

%spark logs

nferreira78
  • 1,013
  • 4
  • 17
  • Thanks for your answer, my bad, I forgot to mention, this error is at session creation, not at running time, so at that point no dataset has been pulled, and unfortunately I don't have access to the nodes through SSH. – Luis Leal Jun 30 '22 at 18:05
  • ok well, I feel your pain. Using Sagemaker connecting to an EMR cluster via sparkmagic and livy, that's a lot of things going on with very little control or chance to debug. However, I would ask if you can please then provide a bit more detail on how you setup `spark.yarn.dist.archives`. Are you using the method `.setConf("spark.yarn.dist.archives", archives)`? If yes, how big (in MB) is the list of archives to be extracted into the working directory of each executor? If not, how are you using `spark-submit`? – nferreira78 Jul 01 '22 at 10:31
  • I am using a sparkmagic notebook, so I add it to the %%conf like this: `%%configure -f { "conf":{ "spark.yarn.dist.archives":"" } } ` – Luis Leal Jul 01 '22 at 21:45
  • and how big (in MB) is that file after decompression? can you do a test with a smaller archive (less `conda` dependencies) to verify if it also fails? If it doesn't fails, I would suggest removing one `conda` dependency at a time, until you find the one that is causing the memory leak. Under SageMaker, click view clusters, under `Application user Interface` in `clusters` tab, that allows to see Spark server history, this allows you to debug – nferreira78 Jul 04 '22 at 11:49
  • Also there are a couple of magics you can use: `%%info` and `%%logs` – nferreira78 Jul 04 '22 at 11:55
  • I've recently updated my answer to show how to look at the Livy logs – nferreira78 Jul 04 '22 at 12:01
  • Hi @LuisLeal, this bounty has ended and I wondered if you could provide more feedback on the suggestions and answer edited. Thanks – nferreira78 Jul 06 '22 at 15:54