Programming environment for Mapreduce - Seamless iterative development

Question

I am new to MapReduce. I started with the simple word-count example.

Using Eclipse IDE, I created a simple Java Maven project, added MapReduce dependencies, compiled my program into a Jar, copied it over to the Cloudera CDH VM, executed it with dummy input data. Once I was satisfied it was running successfully, I took that Jar into my AWS EMR environment and ran it there with a larger (production) dataset.

So, Eclipse is my IDE, Cloudera CDH VM is my Dev environment, and AWS EMR is my production environment.

This setup works fine when I am dealing with a small project like word count, but the bigger my MapReduce projects get, the more cumbersome it is to transport Jar files between environments. It makes iterative development very tedious.

I was wondering if this environment setup I have can be tuned/revamped/scarapped and rebuilt to make it more suitable for iterative and large scale MapReduce development projects.

Any help/tips appreciated. Dankeschön.

score 0 · Accepted Answer · edited May 23 '17 at 11:51

0

Not much has changed since i asked this question. Havent found a good alternative to copying jar files manually to hadoop execution environment. Also see this - Running MapReduce jobs on AWS-EMR from Eclipse

edited May 23 '17 at 11:51

Community

1
1

answered May 18 '15 at 08:34

Quest Monger

8,252
11
37
43

Programming environment for Mapreduce - Seamless iterative development

1 Answers1