Hadoop on Google Compute Engine: how to add external software

Question

I need to set up an Hadoop cluster on Google Compute Engine. While it seems straightforward either using the web console Click&Deploy or via the command line tool bdutil, my concern is that my jobs require additional dependencies present on the machines, for instance Xvfb, Firefox, and others-- though all installable via apt-get.

It's not clear to me the best way to go. The options that come in my mind are:

1) I create a custom image with the additional stuff, and use it for deploying the hadoop cluster, either via or click&deploy. Would that work?

2) Use a standard image and bdutil with a custom configuration files (editing an existing one) to perform all the sudo apt-get install xxx. Is it a viable option?

Option 1) is basically what I had to do in the past to run Hadoop on AWS, and honestly it's a pain to maintain. I'll be more than happy with Option 2) bit I'm not sure butil is allowed to do that.

Do you see any other way to set up the hadoop cluster? Hany help is appreciated!

score 2 · Accepted Answer · answered Jan 16 '15 at 04:34

2

bdutil in fact is designed to support custom extensions; you can certainly edit an existing one for an easy way to get started, but the recommended best-practice is to create your own "_env.sh" extension which can be mixed in with other bdutil extensions if necessary. This way you can more easily merge any updates Google makes to core bdutil without worrying about conflicts with your customizations. You only need to create two files, for example:

File with shell commands:

# install_my_custom_tools.sh

# Shell commands to install whatever you want
apt-get -y install Xvfb

File referencing the commands file which you'll plug into bdutil:

# my_custom_tools_env.sh

COMMAND_GROUPS+=(
  "install_my_custom_tools_group:
     install_my_custom_tools.sh
  "
)

COMMAND_STEPS+=(
  'install_my_custom_tools_group,install_my_custom_tools_group'
)

Then, when running bdutil you can simply mix it in with the -e flag:

./bdutil -e my_custom_tools_env.sh deploy

If you want to organize helper scripts into multiple files, you can easily list more shell scripts within a single COMMAND_GROUP:

COMMAND_GROUPS+=(
  "install_my_custom_tools_group:
     install_my_custom_tools.sh
     my_fancy_configuration_script.sh
  "
)

If you want something to only run on the master, simply provide * to the second argument within COMMAND_STEPS:

COMMAND_GROUPS+=(
  "install_my_custom_tools_group:
     install_my_custom_tools.sh
  "
  "install_on_master_only:
     install_fancy_master_tools.sh
  "
)
COMMAND_STEPS+=(
  'install_my_custom_tools_group,install_my_custom_tools_group'
  'install_on_master_only,*'
)

When using these, you can still easily mix with other env files, for example:

./bdutil -e my_custom_tools_env.sh -e extensions/spark/spark_env.sh deploy

For files residing in the same directory as bdutil or under the extensions directory, you can also use a shorthand notation, only specifying the file basename without the _env.sh suffix:

./bdutil -e my_custom_tools -e spark deploy

answered Jan 16 '15 at 04:34

Dennis Huo

10,517
27
43

very informative, this is how documentation should have been written. Is also via bdutil that hadoop parameters e.g., `mapreduce.job.map` and so on are set? – legrass Jan 16 '15 at 21:15
1

Right, the configuration values are a little trickier to customize. If you look inside bdutil-dir/conf/hadoop*, you'll see files like `mapred-template.xml`. The easiest way to make customizations is just to edit those files inline; any keys you put in `mapred-template.xml` will get mixed into `mapred-site.xml` on the cluster, `core-template.xml` into `core-site.xml`, etc. You'll just need to be more careful merging any changes in core bdutil for version upgrades. For more advanced usage, bdutil-dir/libexec/configure_hadoop.sh is a good example to look at; find the `bdconfig` cmds for examples. – Dennis Huo Jan 16 '15 at 21:33
I see, can be tricky indeed. I ran hadoop on AWS and I could set all the parameters via Cloudera Manager. I'm surprise that google compute engine has no a similar web interface for hadoop. As I'm asking [here](http://stackoverflow.com/questions/27994029/hadoop-on-google-compute-engine-management-console), is there a way to get cloudera manager or similar on GCE? – legrass Jan 16 '15 at 22:51
1

There's some work and collaboration in progress; see here for the latest bdutil work that hasn't been rolled into any formal release quite yet: https://github.com/GoogleCloudPlatform/bdutil/tree/platform_extensions/platforms - there are still some expected bug fixes before it'll be fully suitable for usage, and there will be more comprehensive instructions for using platform plugins once formally released, but in the meantime you can experiment with platforms/hdp/ambari_env.sh – Dennis Huo Jan 17 '15 at 04:35
Great! What image is best for Ambari? Ubuntu 14? – legrass Jan 17 '15 at 17:19
1

In the case of the Ambari bdutil extension, there's a dependency on using CentOS, and ambari_env.sh indeed overrides the default GCE_IMAGE to use centos-6. If you need to customize an image, you'll want to start with a GCE centos-6 image, otherwise leaving the default GCE_IMAGE setting is your best bet for getting something that has been well-tested. – Dennis Huo Jan 17 '15 at 22:25
I was able to deploy with ambari_env.sh following the tutorial on the bdutil git page. However, because I need hadoop1 for some old job, I have tried to install the HDP 1.3 setting the variable AMBARI_STACK_VERSION='1.3' in ambari.conf. The process got stuck at `Invoking on master: ./install-ambari-components.sh Waiting on async 'ssh' jobs to finish. Might take a while...`. Is it possible that also AMBARI_SERVICES needs to be changed to install the stack 1.3? – legrass Jan 29 '15 at 13:18
Looking at install-ambari-components_deploy.stdout it says: `Provisioning ambari cluster. { "status" : 400, "message" : "Unable to update configuration property with topology information. Component 'JOBTRACKER' is not mapped to any host group or is mapped to multiple groups."}` while install-ambari-components_deploy.stderr shows a loop printing `ambari_wait status: curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information` – legrass Jan 29 '15 at 13:19
I have opened an [issue](https://github.com/GoogleCloudPlatform/bdutil/issues/10) – legrass Jan 29 '15 at 13:32

Hadoop on Google Compute Engine: how to add external software

1 Answers1