0

I am using Hadoop streaming, I start the script as following:

../hadoop/bin/hadoop jar ../hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar \
        -mapper ../tests/mapper.php     \
        -reducer ../tests/reducer.php   \
        -input data                     \
        -output out

"data" is 2.5 GB txt file.

however in ps axf I can see only one mapper. i tried with -Dmapred.map.tasks=10, but result is the same - single mapper.

how can I make hadoop split my input file and start several mapper processes?

Nick
  • 9,962
  • 4
  • 42
  • 80
  • 1
    your 2.5 GB txt file, is it gzip compressed? Are you running on a pseudo instance of hadoop (and only have a single map and reduce slot)? – Chris White Nov 28 '12 at 14:46
  • file is not gzipped, but yes, I did not run any hadoop demons, nor I use HDFS... – Nick Nov 28 '12 at 15:05
  • tried on "real" cluster with one node and same result - single mapper process – Nick Nov 28 '12 at 19:19
  • Chris, problem is exactly because I am in pseudo instance mode. I did configured the single node cluster correctly and now it is OK. Pls post answer, so I can select it ;) – Nick Nov 29 '12 at 07:36

1 Answers1

1

To elaborate on my comments - If your file isn't in HDFS, and you're running with local runner then the file itself will only be processed by a single mapper.

A large file is typically processed by several mappers due to the fact that it is stored in HDFS as several blocks.

A 2.5 GB file, with a block size of 512M will be split into ~5 blocks in HDFS. If the file is splittable (plain text, or using a splittable compression codec such as snappy, but not gzip), then hadoop will launch a mapper per block to process the file.

Hope this helps explain what you're seeing

Chris White
  • 29,949
  • 4
  • 71
  • 93