Update:
We added the resize_env.sh to the base bdutil repo so you don't need to go to my fork of it anymore
Original answer:
There isn't official support for resizing a bdutil-deployed cluster just yet, but it's certainly something we've discussed before, and it's in fact fairly doable to put together some basic support for resizing. This may take a different form once merged into the master branch, but I've pushed a first draft of resize support to my fork of bdutil. This was implemented across two commits; one to allow skipping all "master" operations (including create, run_command, delete, etc) and another to add the resize_env.sh
file.
I haven't tested it against all combinations of other bdutil extensions, but I've at least successfully run it with base bdutil_env.sh
and with extensions/spark/spark_env.sh
. In theory it should work fine with your bigquery and datastore extensions as well. To use it in your case:
# Assuming you initially deployed with this command (default n == 2)
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 deploy
# Before this step, edit resize_env.sh and set NEW_NUM_WORKERS to what you want.
# Currently it defaults to 5.
# Deploy only the new workers, e.g. {hadoop-w-2, hadoop-w-3, hadoop-w-4}:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh deploy
# Explicitly start the Hadoop daemons on just the new workers:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh run_command -t workers -- "service hadoop-hdfs-datanode start && service hadoop-mapreduce-tasktracker start"
# If using Spark as well, explicitly start the Spark daemons on the new workers:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh run_command -t workers -u extensions/spark/start_single_spark_worker.sh -- "./start_single_spark_worker.sh"
# From now on, it's as if you originally turned up your cluster with "-n 5".
# When deleting, remember to include those extra workers:
./bdutil -b myhdfsbucket -n 5 delete
In general, the best-practice recommendation is to condense your configuration into a file instead of always passing flags. For example, in your case you might want a file called my_base_env.sh
:
import_env bigquery_env.sh
import_env datastore_env.sh
import_env extensions/spark/spark_env.sh
NUM_WORKERS=2
CONFIGBUCKET=myhdfsbucket
Then the resize commands are much shorter:
# Assuming you initially deployed with this command (default n == 2)
./bdutil -e my_base_env.sh deploy
# Before this step, edit resize_env.sh and set NEW_NUM_WORKERS to what you want.
# Currently it defaults to 5.
# Deploy only the new workers, e.g. {hadoop-w-2, hadoop-w-3, hadoop-w-4}:
./bdutil -e my_base_env.sh -e resize_env.sh deploy
# Explicitly start the Hadoop daemons on just the new workers:
./bdutil -e my_base_env.sh -e resize_env.sh run_command -t workers -- "service hadoop-hdfs-datanode start && service hadoop-mapreduce-tasktracker start"
# If using Spark as well, explicitly start the Spark daemons on the new workers:
./bdutil -e my_base_env.sh -e resize_env.sh run_command -t workers -u extensions/spark/start_single_spark_worker.sh -- "./start_single_spark_worker.sh"
# From now on, it's as if you originally turned up your cluster with "-n 5".
# When deleting, remember to include those extra workers:
./bdutil -b myhdfsbucket -n 5 delete
Finally, this isn't quite 100% the same as if you'd deployed the cluster with -n 5
initially; the files on your master node /home/hadoop/hadoop-install/conf/slaves
and /home/hadoop/spark-install/conf/slaves
in this case will be missing your new nodes. You can manually SSH into your master node and edit these files to add your new nodes to the lists if you ever plan to use /home/hadoop/hadoop-install/bin/[stop|start]-all.sh
or /home/hadoop/spark-install/sbin/[stop|start]-all.sh
; if not, then there's no need to change those slaves files.