8

I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system. Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This host machine uses HDFS filesystem and KVM for virtualization. The whole cloud is managed by Cloudera Manager. There are several Hadoop modules installed on this cluster. There is also a NodeJS data upload service. This time I should make architecture Docker based. I have read several tutorials and have some opinions, but also open questions.

A. What do you think, is https://github.com/Lewuathe/docker-hadoop-cluster a good base for my project? I have found also an official image, but it is single-node.

B. How will system requirements change if I would like to make this in a single container? It would be great, because this architecture should work in different locations, so changes can be easily transferred between these locations. Synchronization between these so called clones would be important.

C. Do you have some other ideas, maybe best practices?

Paul Verest
  • 60,022
  • 51
  • 208
  • 332

3 Answers3

1

As of September 2016 there is no quick answer.

https://github.com/Lewuathe/docker-hadoop-cluster does not seem like a good start, as it should be universal for your B. option

Keep an eye on https://github.com/sequenceiq/hadoop-docker and https://github.com/kiwenlau/hadoop-cluster-docker

Paul Verest
  • 60,022
  • 51
  • 208
  • 332
0

To address your question C., you may want to check out BlueData's software platform: http://www.bluedata.com/blog/2015/06/docker-containers-big-data-clusters

It's designed to run multi-node Hadoop clusters in a Docker-based environment and there is a free version available for download (you can also run it in an AWS EC2 instance).

BlueData
  • 1
  • 1
  • 16 GB RAM on laptop? Pretty much, so I should use Amazon Machine Image instead. –  Jan 27 '16 at 09:37
  • That's right - it requires a pretty beefy machine. We recommend 16GB RAM, but you'd need at least 10GB of dedicated RAM to run a minimum multi-node configuration (e.g. a two-node cluster of a single Hadoop distribution) or multiple distributions on your laptop. – BlueData Jan 27 '16 at 19:19
  • But as you point out, you can use the Amazon Machine Image instead. – BlueData Jan 27 '16 at 19:28
-1

This work has already been done for you, actually:

https://hub.docker.com/r/cloudera/clusterdock/

It includes a pre-packaged multi-node CDH cluster, with Cloudera Manager as an optional component for cluster management et al.

Justin Kestelyn
  • 924
  • 5
  • 12