-1

I have 100GB of JSON files whose each row looks like this:

{"field1":100, "field2":200, "field3":[{"in1":20, "in2":"abc"},{"in1":30, "in2":"xyz"}]}

(It's actually a lot more complicated, but for this'll do as a small demo.)

I want to process it to something whose each row looks like this:

{"field1":100, "field2":200, "abc":20, "xyz":30}

Being extremely new to Hadoop, I just want to know if I'm on the right path:

Refering to this: http://www.glennklockwood.com/di/hadoop-streaming.php For conventional applications I'd create a a mapper and reducer in Python and execute it using something like:

hadoop \
   jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
   -mapper "python $PWD/mapper.py" \
   -reducer "python $PWD/reducer.py" \
   -input "wordcount/mobydick.txt"   \
   -output "wordcount/output"

Now let me know if I'm on the right track:

Since I just need to parse a lot of files into another form; I suppose I don't need any reduction step. I can simply write a mapper which:

  1. Takes input from stdin
  2. Reads std.in line by line
  3. Transforms each line according to my specifications
  4. Outputs into stdout

Then I can run hadoop with simply a mapper and 0 reducers.

Does this approach seem correct? Will I be actually using the cluster properly or would this be as bad as running the Python script on a single host?

user1265125
  • 2,608
  • 8
  • 42
  • 65

1 Answers1

0

You are correct, in this case you don't need any reducer, the output of your mapper is directly what you want so you should set the number of reducers to 0. When you tell Hadoop the input path where your JSON data is, it will automatically feed each mapper with a random number of lines of JSON, which your mapper will process and you need to emit it to the the context, so that it stores the value in the output path. The approach is correct, and this task is 100% parallelizable, so if you have more than one machine in your cluster and your configuration is correct, it should take full advantage of the cluster and it will run much faster than running it on a single host.

Balduz
  • 3,560
  • 19
  • 35
  • That's great! But the output will be in what form? If I'm giving it a location with 2000 files as an input, then the output from the mapper via stdout would automatically be saved as 2000 files? I'll try this out tomorrow, but I'd like to get a decent idea about what goes on in the process. – user1265125 Aug 27 '14 at 18:36
  • The number of output files depends on the number of reducers, since it creates one file per reducer. However, if you set it to 0 reducers, then it will depend on the number of mappers. If you want everything to be in the same file, then put 1 reducer, the IdentityReducer in this case, which only takes the mapper output and put it as reducer output. – Balduz Aug 27 '14 at 18:38