I have setup a JupyterHub and configured a pyspark kernel for it. When I open a pyspark notebook (under username Jeroen), two processes are added, a Python process and a Java process. The Java process is assigned 12g of virtual memory (see image). When running a test script on a range of 1B number it grows to 22g. Is that something to worry about when we work on this server with multiple users? And if it is, how can I prevent Java from allocating so much memory?
Asked
Active
Viewed 725 times
1 Answers
1
You don't need to worry about virtual memory usage, reserved memory is much more important here (the RES
column).
You can control size of JVM heap usage using --driver-memory
option passed to spark (if you use pyspark kernel on jupyterhub you can find it in environment under PYSPARK_SUBMIT_ARGS
key). This is not exactly the memory limit for your application (there are other memory regions on JVM), but it is very close.
So, when you have multiple users setup, you should learn them to set appropriate driver memory (the minimum they need for processing) and shutdown notebooks after they finish work.

Mariusz
- 13,481
- 3
- 60
- 64
-
--driver-memory does appear to limit memory used, because setting a low value generates an OutOfMemory error when caching a large chunk of memory. But it does not reduce the virtual memory assigned. – Jeroen Vuurens Oct 22 '17 at 06:31
-
considering users are responsible to set appropriate driver memory is not really a solution. However, I did configure a cull-idle routine that shuts down idle kernels. – Jeroen Vuurens Oct 22 '17 at 06:34
-
You don't need to worry about virtual memory. It's virtual, so it's free ;-) It it strange, that you are getting OOM when caching - do you run pyspark in local mode? If you move into YARN, cached RDDs will be stored on executors and standard driver memory (1GB AFAIR) will be enough for most usages. – Mariusz Oct 22 '17 at 19:38
-
Thanks, yes I've increased the driver-memory to 1GB and have not seen an OOM since. We are running on a HPC (48threads, 65G RAM) and not on a cluster. – Jeroen Vuurens Oct 23 '17 at 19:49