Hadoop streaming with single mapper

Question

I am using Hadoop streaming, I start the script as following:

../hadoop/bin/hadoop jar ../hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar \
        -mapper ../tests/mapper.php     \
        -reducer ../tests/reducer.php   \
        -input data                     \
        -output out

"data" is 2.5 GB txt file.

however in ps axf I can see only one mapper. i tried with -Dmapred.map.tasks=10, but result is the same - single mapper.

how can I make hadoop split my input file and start several mapper processes?

your 2.5 GB txt file, is it gzip compressed? Are you running on a pseudo instance of hadoop (and only have a single map and reduce slot)? — Chris White, Nov 28 '12 at 14:46
file is not gzipped, but yes, I did not run any hadoop demons, nor I use HDFS... — Nick, Nov 28 '12 at 15:05
tried on "real" cluster with one node and same result - single mapper process — Nick, Nov 28 '12 at 19:19
Chris, problem is exactly because I am in pseudo instance mode. I did configured the single node cluster correctly and now it is OK. Pls post answer, so I can select it ;) — Nick, Nov 29 '12 at 07:36

score 1 · Accepted Answer · answered Nov 29 '12 at 11:15

To elaborate on my comments - If your file isn't in HDFS, and you're running with local runner then the file itself will only be processed by a single mapper.

A large file is typically processed by several mappers due to the fact that it is stored in HDFS as several blocks.

A 2.5 GB file, with a block size of 512M will be split into ~5 blocks in HDFS. If the file is splittable (plain text, or using a splittable compression codec such as snappy, but not gzip), then hadoop will launch a mapper per block to process the file.

Hope this helps explain what you're seeing

Hadoop streaming with single mapper

1 Answers1