1

We are using Mesos 1.20 + Marathon 1.4.3 to run the SparkJob. I am trying to use an algorithm to forecast the job resource usage to achieve the auto-scale up/down. I can see the dynamic resource usage per framework in Mesos web page at http://:5050/#/agents/. However looks like from endpoint, I can only get the usage per slave, such as in below link:

finding active framework current resource usage in mesos

Is there any way through Mesos endpoint I can get the snapshot resource usage per each framework?

I tried this endpoint in mesos slave as well, looks like no cpu/memory information per framework either.

http://agent-ip:5051/metrics/snapshot/slave(1)/monitor/statistics

{
  "slave/executors_terminated": 114751.0,
  "slave/tasks_finished": 63594.0,
  "slave/cpus_total": 8.0,
  "slave/executors_preempted": 0.0,
  "slave/cpus_percent": 1.0125,
  "slave/executors_running": 8.0,
  "slave/gpus_revocable_used": 0.0,
  "slave/invalid_status_updates": 256.0,
  "slave/executors_registering": 0.0,
  "slave/tasks_gone": 0.0,
  "slave/cpus_revocable_percent": 0.0,
  "slave/gpus_total": 0.0,
  "slave/tasks_killed": 50763.0,
  "slave/tasks_starting": 0.0,
  "slave/registered": 1.0,
  "slave/gpus_revocable_total": 0.0,
....
}

Thanks

Martin Peng
  • 87
  • 1
  • 9

1 Answers1

2

To gather this information you need to query each agent /slave/monitor/statistics/ endpoint and collect all executors metrics and group executor metrics by its framework id.


Here is a Diamond Mesos Collector that do this but it collect only single agent data. You can group them in your metric visualization tool e.g. Grafana.

janisz
  • 6,292
  • 4
  • 37
  • 70
  • Thanks @janisz! Should I connect to the http://:5050/slave/monitor/staticstics ? I connect to http://:5050/metrics/snapshot/slave/statistics and looks like it still return the statistics of master. Is there any thing I need to enable as well? – Martin Peng Mar 27 '18 at 16:40
  • 1
    My bad, just realized the agent port is 5051. I was able to connect to get the slave staticstics by http://:5051/metrics/snapshot/slave(1)/monitor/statistics, however there is no framework specific information there. Any thing I am missing? Thanks! – Martin Peng Mar 27 '18 at 17:07
  • Thanks! Looks like I made another mistake again. The correct url should be http://:5051/slave(1)/monitor/statistics. I also read the code you sent, and it is calculating the cpu usage by the cpu times. Are these numbers more accurate than Mesos web page? The CPU usage in real time Mesos web page looks like very small. – Martin Peng Mar 27 '18 at 17:56
  • It's using exactly same technique as the Mesos UI since it was strongly influenced by UI code so the numbers should be similar. – janisz Mar 27 '18 at 17:59
  • 1
    Thanks @janisz! Will collect the data and get back to you. – Martin Peng Mar 27 '18 at 21:01