I have 100GB of JSON files whose each row looks like this:
{"field1":100, "field2":200, "field3":[{"in1":20, "in2":"abc"},{"in1":30, "in2":"xyz"}]}
(It's actually a lot more complicated, but for this'll do as a small demo.)
I want to process it to something whose each row looks like this:
{"field1":100, "field2":200, "abc":20, "xyz":30}
Being extremely new to Hadoop, I just want to know if I'm on the right path:
Refering to this: http://www.glennklockwood.com/di/hadoop-streaming.php For conventional applications I'd create a a mapper and reducer in Python and execute it using something like:
hadoop \
jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/mobydick.txt" \
-output "wordcount/output"
Now let me know if I'm on the right track:
Since I just need to parse a lot of files into another form; I suppose I don't need any reduction step. I can simply write a mapper which:
- Takes input from stdin
- Reads std.in line by line
- Transforms each line according to my specifications
- Outputs into stdout
Then I can run hadoop with simply a mapper and 0 reducers.
Does this approach seem correct? Will I be actually using the cluster properly or would this be as bad as running the Python script on a single host?