Let's say, I want to keep creating a sessions for every Spark job that is submitted to the YARN. Every connection has a unique user, who keeps polling the status and post statements to a session. How do I calculate, at any given time, Livy can have, how many active sessions? Is it based on the livy.spark.driver size that I configure? what are all the other parameters involved in this calculation ?
1 Answers
yarn has a scheduler to utilize AM containers and livy will be initializing accepted requests on yarn with available resource on cluster/standalone server. see yarn-scheduler livy-client.conf should be configured to handle long running jobs to yield reponse.
livy-client.conf
Time between status checks for cancelled a Job
livy.rsc.job-cancel.trigger-interval = 100ms
Time before a cancelled a Job is forced into a Cancelled state
livy.rsc.job-cancel.timeout = 60m
here is a sample code you should filter state: busy sessions from output.
import requests
host = "{livy_host}:8998"
sessions = requests.get(host + '/sessions/')
output b'{"from":0,"total":1,"sessions":[{"id":3,"appId":"application_1566223151385_0085","owner":null,"proxyUser":null,"state":"busy","kind":"pyspark","appInfo":{"driverLogUrl":"{livy_host}:8042/node/containerlogs/container_e182_1566223151385_0085_01_000001/mapr","sparkUiUrl":"{livy_host}:8088/proxy/application_1566223151385_0085/"},"log":[""]}]}'
sum(session['state'] == 'busy' for session in sessions.json()['sessions'])

- 1,002
- 10
- 21