1

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)

Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".

Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.

Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.

Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?

Thanks.

zpz
  • 354
  • 1
  • 3
  • 16

1 Answers1

0

This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.

If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:

gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
    --execute="sh sleep $((5 * 60 * 60))"
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • 1
    Thanks for this. I have not figured out how to use the Jobs API, but I have to use it. I can't use `gcloud`. It's integrated in Python program, and it's not in a Spark setting. I think I'll send a Python script in a string that sleeps indefinitely, and I'll delete the job later as needed. If you happen to have a link to such an API usage example, please share! I'm looking at this API https://github.com/googleapis/python-dataproc/blob/master/google/cloud/dataproc_v1/services/job_controller/client.py I believe the `submit_job` method is what I'll use, but haven't figured it out yet. – zpz Mar 14 '21 at 07:12
  • 1
    Figured it out. Submit this `job = { 'placement': { 'cluster_name': cluster_name, }, 'pig_job': { 'query_list': { 'queries': [ "sh sleep 3600", ] } }, }` – zpz Mar 14 '21 at 09:10