0

Which HTTP endpoint will help me to find all the active frameworks current resource utilization?

We want this information because we want to dynamically scale Mesos cluster and our algorithm needs information regarding what resources each active framework is using.

Alexander Farber
  • 21,519
  • 75
  • 241
  • 416

1 Answers1

1

I think to focus on the frameworks is not really what you would want to to. What you're after is probably the Mesos Slave utilization, which can be requested via calling

http://{mesos-master}:5050/master/state-summary

In the JSON answer, you'll find a slaves property which contains an array of slave objects:

{
    "hostname": "192.168.0.3",
    "cluster": "mesos-hw-cluster",
    "slaves": [{
        "id": "bd9c29d7-8530-4c5b-8c50-5d2f60dffbf6-S2",
        "pid": "slave(1)@192.168.0.1:5051",
        "hostname": "192.168.0.1",
        "registered_time": 1456826950.99075,
        "resources": {
            "cpus": 12.0,
            "disk": 1840852.0,
            "mem": 63304.0,
            "ports": "[31000-32000]"
        },
        "used_resources": {
            "cpus": 5.75,
            "disk": 0.0,
            "mem": 14376.0,
            "ports": "[31000-31000, 31109-31109, 31267-31267, 31699-31699, 31717-31717, 31907-31907, 31979-31980]"
        },
        "offered_resources": {
            "cpus": 0.0,
            "disk": 0.0,
            "mem": 0.0
        },
        "reserved_resources": {},
        "unreserved_resources": {
            "cpus": 12.0,
            "disk": 1840852.0,
            "mem": 63304.0,
            "ports": "[31000-32000]"
        },
        "attributes": {},
        "active": true,
        "version": "0.27.1",
        "TASK_STAGING": 0,
        "TASK_STARTING": 0,
        "TASK_RUNNING": 7,
        "TASK_FINISHED": 18,
        "TASK_KILLED": 27,
        "TASK_FAILED": 3,
        "TASK_LOST": 0,
        "TASK_ERROR": 0,
        "framework_ids": ["bd9c29d7-8530-4c5b-8c50-5d2f60dffbf6-0000", "bd9c29d7-8530-4c5b-8c50-5d2f60dffbf6-0002"]
    },
    ...
}

You could iterate over all the slave objects and calculate the overall ressource usage by summarizing the resources and then subtract the summary of the used_resources.

See

Tobi
  • 31,405
  • 8
  • 58
  • 90
  • Thank u Tobi. We are indeed taking into account the slave utilization and overall cluster resource utilization. But often we observe that even though some job was active wasn't getting any resources for long time as all the resources were utilized by other jobs. In this case we want all those active frameworks who were not allocated any resources at all. Simple flow was 3 spark jobs & 2 slaves. 2 jobs took all the resources of those 2 slave and 3rd job didnt receive any offers for long time, also load & cpu utilization of both the slave was low so no indication there – kovit nisar Mar 03 '16 at 09:16
  • Hmmm, I doubt that this is the correct way to takle this. You should make sure that you limit the Spark executor's resources (see http://spark.apache.org/docs/latest/running-on-mesos.html#configuration). Otherwise it will directly take all the resources available... Have a look at the Mesos API docs, there are endpoints which should give you the info. But first I'd fix the Spark behaviour. – Tobi Mar 03 '16 at 09:51
  • To add: What you actually see if you if you do a `top` on the slave must not necessarily be what you see in the slave's `used_resources`. IMHO Spark grabs all of the memory and cores available, not definitely meaning that the OS utilization is 100%. Still, what counts from the Mesos perspective is the resources taken by the Spark tasks... – Tobi Mar 03 '16 at 09:54
  • That make sense. We can restrict spark to use some max num of cores. But this Mesos cluster is on cloud, so any authorized users can also run some task on it and we can not control those properties of Spark. So we have to support such conditions. Basically this Mesos cluster supports both Data center and cloud kind of hybrid. Thank you Tobi I will look into the mesos http endpoints link you gave and figure out. – kovit nisar Mar 03 '16 at 19:27
  • I see... Maybe using something like [Spark Jobserver](https://github.com/spark-jobserver/spark-jobserver) would make sense for your use case, because IMHO this would allow you to control the context creation, and therefore also the configuration parameters. – Tobi Mar 04 '16 at 07:56