1

When I run wordcount.py (python mrjob http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-your-first-job) using hadoop streaming on a text file it gives me the output, but when the same is run against .snappy files I got zero size output.

Options Tried:

[testgen word_count]# cat mrjob.conf 
runners:
  hadoop: # this will work for both hadoop and emr
    jobconf:
      mapreduce.task.timeout: 3600000
      #mapreduce.max.split.size: 20971520
      #mapreduce.input.fileinputformat.split.maxsize: 102400
      #mapreduce.map.memory.mb: 8192
      mapred.map.child.java.opts: -Xmx4294967296
      mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      # "true" must be a string argument, not a boolean! (#323)
      #mapreduce.output.compress: "true"
      #mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec

[testgen word_count]# 

command:

[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf 
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
 mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job  -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP:  map 0%  reduce 0%
HADOOP:  map 100%  reduce 0%
HADOOP:  map 100%  reduce 11%
HADOOP:  map 100%  reduce 97%
HADOOP:  map 100%  reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
  (no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output

removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]# 

No errors thrown, job output is successful, Verified job configurations in the job stats it has taken.

Is there any other way to troubleshoot?

user3369417
  • 358
  • 1
  • 4
  • 17

2 Answers2

1

I think you are not using correctly options.

In your mrjob.conf file:

  1. mapreduce.output.compress: "true" means that you want a compressed output
  2. mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec means that the compression uses Snappy codec

You are apparently expecting that your compressed inputs will be correctly read by your mappers. Unfortunately, it does not work like that. If you really want to feed your job with compressed data, you may look at SequenceFile. Another simpler solution would be to feed your job with text files only.

What about also configuring your input format, like mapreduce.input.compression.codec: org.apache.hadoop.io.compress.SnappyCodec

[Edit: you should also remove this symbol # at the beginning of lines that define options. Otherwise, they will be ignored]

dlamblin
  • 43,965
  • 20
  • 101
  • 140
Yann
  • 361
  • 2
  • 7
0

Thanks for your inputs Yann, but finally the below line inserted into the job script solved the problem.

HADOOP_INPUT_FORMAT='<org.hadoop.snappy.codec>'
user3369417
  • 358
  • 1
  • 4
  • 17