Why is GCP Dataproc's cluster auto-scaling using YARN as RM based on memory requests and NOT cores? Is it limitation of Dataproc or YARN or am I missing something?
Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
Autoscaling configures Hadoop YARN to schedule jobs based on YARN memory requests, not on YARN core requests.
Autoscaling is centered around the following Hadoop YARN metrics:
Allocated memory refers to the total YARN memory taken up by running containers across the whole cluster. If there are 6 running containers that can use up to 1GB, there is 6GB of allocated memory.
Available memory is YARN memory in the cluster not used by allocated containers. If there is 10GB of memory across all node managers and 6GB of allocated memory, there is 4GB of available memory. If there is available (unused) memory in the cluster, autoscaling may remove workers from the cluster.
Pending memory is the sum of YARN memory requests for pending containers. Pending containers are waiting for space to run in YARN. Pending memory is non-zero only if available memory is zero or too small to allocate to the next container. If there are pending containers, autoscaling may add workers to the cluster.