0

We're running a Marathon on top of another Marathon (MoM) in DC/OS. marathon memory usage

The cluster is relatively small, about 40 nodes and 400 running tasks. I was surprised that Marathon is not shipped with any GC configuration. After Marathon instance becomes leader the memory usage is growing considerably. Especially during handling resource offers.

I'm noticed that Tomek from Allegro run into similar problems, however he doesn't mention any specific configuration. Does anyone have any battle-tested configuration?

We're using Marathon 1.5.3.

Related issues:

Tombart
  • 30,520
  • 16
  • 123
  • 136
  • Hi, I think GC is done by the containerizer and not marathon. If you're using the Mesos Containerizer, container image GC shipped in 1.5 so you might want to upgrade Mesos if you're using an earlier version. Not sure if container image GC is the same thing needed for Java GC though. – Judith Malnick Apr 24 '18 at 17:03
  • Hi Judith, you're talking about Docker images GC, that's something different. Have a look at [Jörg's talk "No one puts Java in the Container"](https://dcos.io/events/2017/no-one-puts-java-in-the-container/) and then ask yourself Why did we put [Marathon running on Java](https://hub.docker.com/r/mesosphere/marathon/~/dockerfile/) into a container? :) – Tombart Apr 25 '18 at 08:01

1 Answers1

0

Default Java GC configuration is not optimal for running Java applications in a container as it is explained in detail by Jörg Schad from Mesosphere. Why Mesosphere doesn't apply suggested configuration its container orchestrator remains mystery to me.

By default Java 8 uses ParallelGC (-XX:+UseParallelGC).

Suggested Java flags:

  • -Xmx1536m Maximum heap size, depends on your cluster size (number of tasks running via Marathon)
  • -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled As response time is critical for a cluster orchestrator, concurrent collector seems to be best fit. As documentation says:

    The Concurrent Mark Sweep (CMS) collector is designed for applications that prefer shorter garbage collection pauses and that can afford to share processor resources with the garbage collector while the application is running.

  • -XX:+UseParNewGC Disables parallel young generation GC that is automatically enabled with CMS.
  • -XX:ParallelGCThreads=2 Threads are automatically set by number of logical processors available on your machine. This makes GC inefficient especially in case when the physical machine has 12 (or even more) cores and you're limiting CPUs in Mesos. Should equal to number of cpus assigned to Marathon.
  • -XX:+UseCMSInitiatingOccupancyOnly Prevents using set of heuristic rules to trigger garbage collection. That would make GC less predictable and usually tends to delay collection until old generation is almost occupied. Initiating GC in advance allows to complete collection before old generation is full and thus avoid Full GC (i.e. stop-the-world pause).
  • -XX:CMSInitiatingOccupancyFraction=80 Informs Java VM when CMS should be triggered. Basically, it allows to create a buffer in heap, which can be filled with data, while CMS is working. 70-80 seems to be reasonable value. If the value is too small GC will be triggered frequently, if it's large, GC will be trigged too late.
  • -XX:MaxTenuringThreshold=1 Limits copying object to old generation pool. Default for CMS is 4.
  • -XX:+UseCGroupMemoryLimitForHeap -XX:+UnlockExperimentalVMOptions heap would be set according to cgroups settings, useful if we don't set -Xmx1536m. Both flags are need for Java 8 (that is currently used mesosphere/marathon Docker image).

Here's Marathon configuration we're currently using:

 "cpus": 2,
 "mem": 2304,
 "env": {
        "JVM_OPTS": "-Xms512m -Xmx1536m -XX:+PrintGCApplicationStoppedTime -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=80 -XX:MaxGCPauseMillis=200 -XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=90 -XX:TargetSurvivorRatio=9 -XX:ParallelGCThreads=2 "
     },

Here's Marathon behavior on the same cluster after GC tuning:

marathon after GC optimization

Tombart
  • 30,520
  • 16
  • 123
  • 136