Dataproc node setup

Question

I understand google dataproc clusters are equipped to handle initialization actions - which are executed on creation of every node. However, this is only reasonable for small actions, and would not do well with creating nodes with tons of dependencies and software for large pipelines. Thus, I was wondering - is there anyway to load nodes as custom images or have an image spin up once the node is created that has all the installs on it, so you don't have to download things again and again.

score 0 · Answer 1 · answered Jun 15 '17 at 16:08

Good question.

As you note, initialization actions are currently the canonical way to install stuff on Clusters when they are created. If you have a ton of dependancies, or need to do things like compile from source, those initialization actions may take a bit.

We have support for a better method to handle customizations on our long-term roadmap. This may be via custom images or some other mechanism.

In the interim, scaling clusters up/down may provide some relief if you want to keep some of the customizations in place and split the difference between boot time and the persistence of your cluster. Likewise, if there are any precompiled packages, those always save time.

Dataproc node setup

1 Answers1