I'm using Sagemaker connecting to an EMR cluster via sparkmagic and livy, very frequently I get(at session startup, not running any code):
> The code failed because of a fatal error: Session <ID> unexpectedly
> reached final status 'dead'. See logs: stdout:
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map <bytes num> bytes for committing reserved memory.
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid24900.log
I have tested some solutions found on the web like reducing drive and worker memory, but is not working even when setting small memory size(512MB), question is, how can I fix it? or how to debug it considering I'm not an admin and I don't have access to livy server nor cluster OS? where is the log(which host) described in the error(/tmp/hs_err_pid24900.log)?
Update I found this happens only when using:
spark.yarn.dist.archives
To distribute a conda environment(in tar.gz) to the driver and workers, the weird thing is, If I remove this, even after making driver and executor memory really big it doesn't complain about memory issues, so I know there is sufficient memory, but adding this makes it crash, is there any java property or java limit triggered when spark.yarn.dist.archives is used?