6

I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31

1 Answers1

5

This tutorial assumes the following:

  • You know how to create GCP Dataproc clusters, either by API calls, cloud shell commands or Web UI
  • You know how to submit a Spark Job
  • You have permissions to launch jobs, create clusters and use Compute Engine instances

After some attempts, I've discovered how to debug on your local machine a DataProc Spark Job running on a cluster.

As you may know, you can submit a Spark Job either by using the Web UI, sending a request to the DataProc API or using the gcloud dataproc jobs submit spark command. Whichever way, you start by adding the following key-value pair to the properties field in the SparkJob: spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=REMOTE_PORT, where REMOTE_PORT is the port on the worker where the driver will be listening.

Chances are your cluster is on a private network and you need to create a SSH tunnel to the REMOTE_PORT. If that's not the case, you're lucky and you just need to connect to the worker using the public IP and the specified REMOTE_PORT on your IDE.

Using IntelliJ it would be like this:

Debugging on public IP cluster,

where worker-ip is the worker which is listening (I've used 9094 as port this time). After a few attempts, I realized it's always the worker number 0, but you can connect to it and check whether there is a process running using netstat -tulnp | grep REMOTE_PORT

If for whatever reason your cluster does not have a public IP, you need to set a SSH tunnel from your local machine to the worker. After specifying your ZONE and PROJECT you create a tunnel to REMOTE_PORT:

gcloud compute ssh CLUSTER_NAME-w-0  --project=$PROJECT --zone=$ZONE  --  -4 -N  -L LOCAL_PORT:CLUSTER_NAME-w-0:REMOTE_PORT

And you set your debug configuration on your IDE pointing to host=localhost/127.0.0.1 and port=LOCAL_PORT

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • 1
    I got this error when trying this method (using dataproc command line. adding argument --properties spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=9094): ERROR: JDWP Non-server transport dt_socket must have a connection address specified through the 'address=' option – whatsnext Jul 28 '20 at 23:10
  • Had the same issue. The problem was that comma is the default delimiter for different key-value sets of properties. This behaviour can be changed by specifying different delimiter https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties `--properties ^#^spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=13337` – Maksim Gayduk Mar 28 '22 at 09:25