How do I install Python libraries automatically on Dataproc cluster startup?

Question

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need.

It would be great to also know if this automated installation could install things only on the master and not the workers.

score 7 · Accepted Answer · answered Sep 23 '15 at 17:29

Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters via the Google Cloud SDK or the Google Developers Console.

Here is a sample initialization action to install the Python pandas on cluster creation only on the master node.

#!/bin/sh
ROLE=$(/usr/share/google/get_metadata_value attributes/role)
if [[ "${ROLE}" == 'Master' ]]; then 
  apt-get install python-pandas -y
fi

As you can see from this script, it is possible to discern the role of a node with /usr/share/google/get_metadata_value attributes/role and then perform action specifically on the master (or worker) node.

You can see the Google Cloud Dataproc Documentation for more details

Dataproc documentation is a bit out of date. You need to run ```/usr/share/google/get_metadata_value attributes/dataproc-role``` to get the string "Master". Command ```/usr/share/google/get_metadata_value attributes/``` gives a list of available attributes. — jacekbj, Aug 01 '16 at 09:42
Update from github/googleapis/python-dataproc repo [link](https://github.com/googleapis/python-dataproc/blob/main/google/cloud/dataproc_v1/types/clusters.py), know you can also access with `curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-role` — Cir02, Dec 20 '22 at 12:45

How do I install Python libraries automatically on Dataproc cluster startup?

1 Answers1

Linked