0

Does anyone know what the following from the FAQs(https://cloud.google.com/dataproc/docs/resources/faq) actually means?

"Data can be user encrypted in transit to and from a cluster, upon cluster creation or job submission."

I can find no options for enabling encryption during cluster creation. Does this basically mean, configure all of the components yourself to ensure comms are encrypted.

We are keen to understand if Spark/Hive/Tez Jobs use encrypted communications channnels when executing a job or via connecting into Hive via the jdbc connection.

Are there any existing initialziation actions for this or does this statement basically mean its all up to you?

K2J
  • 2,573
  • 6
  • 27
  • 34

1 Answers1

1

I assume that is talking about authentication/authorization/encryption when talking to GCP APIs. Importantly, if your data in is in GCS or BigQuery, the data transfer is secured. Also, all communication with Dataproc's control plane (e.g. creation of clusters, submission of jobs) is secured.

You're correct that communication within the cluster is not secured, but it is essentially airgapped. Node-to-node communication happens over internal IPs on your isolated VPC network. Dataproc has guidance on how to configure firewall rules.

You can also use Dataproc private IP clusters to avoid having external IP addresses on the VMs.

I'm not aware of any init action to set up Kerberos -- so yes you would have to DIY.

Karthik Palaniappan
  • 1,373
  • 8
  • 11