4

I am working on submitting Spark job using Apache Livy batches POST method.

This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.

I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.

Is this possible to do using Apache Livy REST API?

Ramdev Sharma
  • 974
  • 1
  • 12
  • 17
  • https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/ See track_statement_progress part – ookboy24 Jan 20 '19 at 06:16
  • thank you. I was helpful when started. I got more detail that I was looking for from @kaxil answer. Thanks. – Ramdev Sharma Jan 21 '19 at 16:15
  • 2
    @RamdevSharma - the livy log endpoint seems to be the logs for Livy submitting the batch, not the spark driver logs itself. Am I missing something? I'm trying to solve the same issue, expose spark logs in Airflow so I don't have to jump to EMR to debug. Thanks – alexP_Keaton Jul 16 '19 at 23:39
  • @alexP_Keaton were you able to solve this problem? The log endpoint on Livy aren't the driver logs and I'm not sure if there's a way (at least right now) of polling Spark driver logs through Livy. – sbrk May 08 '20 at 21:56

2 Answers2

9

Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.

Documentation:

You can create python functions like the one shown below to get logs:

http = HttpHook("GET", http_conn_id=http_conn_id)

def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
    if not extra_options:
        extra_options = {}

    self.http.method = method
    response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)

    return response


def _get_batch_session_logs(self, batch_id):
    method = "GET"
    endpoint = "batches/" + str(batch_id) + "/log"
    response = self._http_rest_call(method=method, endpoint=endpoint)
    # return response.json()
    return response
kaxil
  • 17,706
  • 2
  • 59
  • 78
  • Thanks @kaxil , This is what i was looking for and some how missed. Based on end point inputs from and size, I can show logs on each status check. – Ramdev Sharma Jan 21 '19 at 16:13
  • 3
    It seems livy stored batch logs in jvm without persisting on disk... I can't find any logs files. If livy does persist logs, where is the location? – Archon Oct 21 '19 at 03:29
  • @RamdevSharma would you be willing to share your code? I am stuck on http_conn_id=http_conn_id, and not knowing where that is coming from. Thank you! – user1983682 Jan 13 '20 at 22:20
  • @user1983682 , I am sorry but I do not have source code since I have moved out of project. You can look for Connection is Air-flow UI to create for http. – Ramdev Sharma Jan 14 '20 at 13:34
  • I understand. @kaxil would you have any idea why I would be receiving an error "ERROR - _get_batch_session_logs() got an unexpected keyword argument 'next_execution_date'" with this solution? – user1983682 Jan 15 '20 at 04:17
  • @user1983682 Happy to help, can you share your code please? – kaxil Jan 15 '20 at 10:50
  • Anyone has any idea where livy is storing the batch details, because if you query the batch details after batch completed , you are not getting the status – dileepVikram Jun 08 '20 at 18:34
1

Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:

curl http://livy-server-IP:8998/batches

Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:

curl http://livy-server-IP:8998/batches/{batchId}/log

You can find the documentation at: https://livy.incubator.apache.org/docs/latest/rest-api.html

If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.

Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.

Link for AWS Marketplace: https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V

This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).