1

I'm trying to create a session in Apache Spark using the Livy rest API. It fails with the following error: User capacity has reached its maximum limit..

The user is running another spark job. I don't understand which capacity reached its maximum and how to fix it adjusting Spark conf parameters. Here is the log info that I think is relevant. I reformatted it to make it clearer:

22/05/30 19:18:51 INFO Client: Submitting application application_1653913029140_0247 to ResourceManager
22/05/30 19:18:51 INFO YarnClientImpl: Submitted application application_1653913029140_0247
22/05/30 19:18:51 INFO Client: Application report for application_1653913029140_0247 (state: ACCEPTED)
22/05/30 19:18:51 INFO Client: 
    client token: N/A
    diagnostics: [Mon May 30 19:18:51 -0300 2022] 
        Application is Activated, waiting for resources to be assigned for AM. User capacity has reached its maximum limit. 
        Details : AM Partition = <DEFAULT_PARTITION> ; 
        Partition Resource = <memory:2662400, vCores:234> ; 
        Queue's Absolute capacity = 32.0 % ; 
        Queue's Absolute used capacity = 40.76923 % ; 
        Queue's Absolute max capacity = 100.0 % ; 
        Queue's capacity (absolute resource) = <memory:851967, vCores:74> ; 
        Queue's used capacity (absolute resource) = <memory:1085440, vCores:106> ; 
        Queue's max capacity (absolute resource) = <memory:2662400, vCores:234> ; "
    ApplicationMaster host: N/A
    ApplicationMaster RPC port: -1
    queue: default
    start time: 1653949131433
    final status: UNDEFINED
    tracking URL: http://vrt1557.bndes.net:8088/proxy/application_1653913029140_0247/
    user: s-dtl-p01
22/05/30 19:18:51 INFO ShutdownHookManager: Shutdown hook called

The other running job has configured some spark parameters for high performance:

conf = {'spark.yarn.appMasterEnv.PYSPARK_PYTHON': 'python3',
                'spark.cores.max': 50,
                'spark.executor.memory': '10g',
                'spark.executor.instances': 100,
                'spark.driver.memory' : '10g'
        }

The job that failed to start didn't configure any spark parameter and is using the cluster default values.

Sure I can tweak the spark parameters of the job that is running, so it won't prevent the allocation of resources for the new job, but I'd like to understand it. The queue configuration also has a lot of parameters that should be interacting with the application.

Which resource is exhausted? How do I discover it based in the log below?

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
neves
  • 33,186
  • 27
  • 159
  • 192

1 Answers1

1

This diagnostics is produced by YARN capacity scheduler when it determines that allocating resources requested by the application will violate preset per-user limits. Here is the relevant piece from LeafQueue.java:

  :
  if (!userAssignable) {
    application.updateAMContainerDiagnostics(AMState.ACTIVATED,
        "User capacity has reached its maximum limit.");
    ActivitiesLogger.APP.recordRejectedAppActivityFromLeafQueue(
        activitiesManager, node, application, application.getPriority(),
        ActivityDiagnosticConstant.QUEUE_HIT_USER_MAX_CAPACITY_LIMIT);
    continue;
  }
  :

Hence, queue-level metrics you cited would probably be insufficient to identify what capacity limit is getting breached. Perhaps you can enable DEBUG logging for the scheduler, and then look for one of the messages generated from LeafQueue.canAssignToUser() method.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • Thanks for you answer. Can I activate the scheduler DEBUG logging using a parameter in spark-submit? Or is it some server configuration? Do you have any reference about how to activate it? – neves Jun 02 '22 at 16:15
  • 1
    Since scheduler is a component of YARN ResourceManager (RM) daemon, its log level can't be changed from the application side. Maybe there are other ways, but either following this (https://stackoverflow.com/questions/27853974/how-to-set-debug-log-level-for-resourcemanager) or editing `log4j.properties` itself to add something like `log4j.logger.org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler=DEBUG` should do it I imagine. – mazaneicha Jun 02 '22 at 19:10