0

I have just started out with Apache Hadoop, as such, my first goal is basically to get a "hello world" app running. The first task is always to set up the development environment and be able to compile code. More specifically, I am trying to compile the classes found here. These files represent a simple MapReduce job as part of a book on Hadoop.

The book author used hadoop-client as a dependency (source), but since there are so many artifacts - which I shall return to - I wonder if I can not use another dependency. I am always trying to "import" or depend only on the most minimal set of artifacts and types.

The book author did not (yet) touch on the topic of which artifacts Hadoop distributes, why I would use one or the other. Hadoop's website and the rest of the Internet don't seem to bother with this little "detail" either. Some SO threads has sort of touched on this before (see this and that), to which there are answers with opinions on what artifact "should" be put as a dependency to get the specific code in question to compile.

This is not my question. Getting my code to compile is rather "easy" and accomplished already. I am trying to figure out what artifacts exist, which ones should I use when. How would I know to go from Java type A to binary artifact dependency B? Most importantly, where are all this documented?

For starters, what build artifacts exists?

Well, according to this page, there are these:

hadoop-client
hadoop-client-api
hadoop-client-minicluster
hadoop-client-runtime
hadoop-hdfs-client
hadoop-hdfs-native-client
hadoop-mapreduce-client-app
hadoop-mapreduce-client-common
hadoop-mapreduce-client-core
hadoop-mapreduce-client-jobclient
hadoop-mapreduce-client-nativetask
hadoop-yarn-client

But according to JCenter, there are about five million things more. In particular, about four million nine hundred ninety-nine thousand nine hundred ninety-nine of these has the word "client" in it. Mighty confusing!

Working off of the list from Hadoop, I could simply test what works and what does not work. To get all of the imports used in the classes provided by my book, the following all worked:

hadoop-client
hadoop-client-api
hadoop-client-minicluster
hadoop-client-runtime
hadoop-mapreduce-client-app
hadoop-mapreduce-client-nativetask

The ones I left out did not work, to various degrees. Some could not resolve all of the imports, some could only resolve parts of them.

My personal bet here - if I want to depend on as little crap as possible, is to use hadoop-mapreduce-client-app. But it bugs the hell out of me to have to resort to this Gorilla warfare just to get the most mundane piece of "hello world" app working. I don't want to know how many tears I will shed in the future when I really get down and dirty with Hadoop.

There has to be something I am missing!

Martin Andersson
  • 18,072
  • 9
  • 87
  • 115

1 Answers1

1

I suggest you just use Maven/Gradle to transititively pull everything you need.

If all you want is MapReduce dependencies, this has worked fine for me in Gradle

implementation group: 'org.apache.hadoop', name: 'hadoop-client', version: "2.8.5"

This one is a aggregator POM, which has Compile Dependencies (scroll down) on several other libraries.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245