2

I have a python file to count bigrams using mrjob up on Hadoop (version 2.6.0), but I'm not getting the output that I'm hoping for and I'm having trouble deciphering the output in my terminal for where I'm going wrong.

My code:

regex_for_words = re.compile(r"\b[\w']+\b")

class BiCo(MRJob):
  OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol

  def mapper(self, _, line):
    words = regex_for_words.findall(line)
    wordsinline = list()
    for word in words:
        wordsinline.append(word.lower()) 
    wordscounter = 0
    totalwords = len(wordsinline)
    for word in wordsinline:
        if wordscounter < (totalwords - 1):
            nextword_pos = wordscounter+1
            nextword = wordsinline[nextword_pos]
            bigram = word, nextword
            wordscounter +=1
            yield (bigram, 1)

  def combiner(self, bigram, counts):
    yield (bigram, sum(counts))

  def reducer(self, bigram, counts):
    yield (bigram, str(sum(counts)))

if __name__ == '__main__':
  BiCo.run()

I wrote the code in my mapper function (basically, everything up through the "yield" line) on my local machine to ensure that my code was grabbing bigrams as intended, so I think it should be working fine....but, of course, something's going awry.

When I run the code on the Hadoop server, I get the following output (apologies if this is more than is necessary - the screen outputs a ton of information and I'm not yet certain what will be helpful to honing in on the problem area):

HADOOP: 2015-10-25 17:00:46,992 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Running job: job_1438612881113_6410
HADOOP: 2015-10-25 17:00:52,110 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1376)) - Job job_1438612881113_6410 running in uber mode : false
HADOOP: 2015-10-25 17:00:52,111 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 0% reduce 0%
HADOOP: 2015-10-25 17:00:58,171 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 33% reduce 0%
HADOOP: 2015-10-25 17:01:00,184 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 100% reduce 0%
HADOOP: 2015-10-25 17:01:07,222 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) -  map 100% reduce 100%
HADOOP: 2015-10-25 17:01:08,239 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1394)) - Job job_1438612881113_6410 completed successfully
HADOOP: 2015-10-25 17:01:08,321 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1401)) - Counters: 51
HADOOP:         File System Counters
HADOOP:                 FILE: Number of bytes read=2007840
HADOOP:                 FILE: Number of bytes written=4485245
HADOOP:                 FILE: Number of read operations=0
HADOOP:                 FILE: Number of large read operations=0
HADOOP:                 FILE: Number of write operations=0
HADOOP:                 HDFS: Number of bytes read=1013129
HADOOP:                 HDFS: Number of bytes written=0
HADOOP:                 HDFS: Number of read operations=12
HADOOP:                 HDFS: Number of large read operations=0
HADOOP:                 HDFS: Number of write operations=2
HADOOP:         Job Counters
HADOOP:                 Killed map tasks=1
HADOOP:                 Launched map tasks=4
HADOOP:                 Launched reduce tasks=1
HADOOP:                 Rack-local map tasks=4
HADOOP:                 Total time spent by all maps in occupied slots (ms)=33282
HADOOP:                 Total time spent by all reduces in occupied slots (ms)=12358
HADOOP:                 Total time spent by all map tasks (ms)=16641
HADOOP:                 Total time spent by all reduce tasks (ms)=6179
HADOOP:                 Total vcore-seconds taken by all map tasks=16641
HADOOP:                 Total vcore-seconds taken by all reduce tasks=6179
HADOOP:                 Total megabyte-seconds taken by all map tasks=51121152
HADOOP:                 Total megabyte-seconds taken by all reduce tasks=18981888
HADOOP:         Map-Reduce Framework
HADOOP:                 Map input records=28214
HADOOP:                 Map output records=133627
HADOOP:                 Map output bytes=2613219
HADOOP:                 Map output materialized bytes=2007852
HADOOP:                 Input split bytes=304
HADOOP:                 Combine input records=133627
HADOOP:                 Combine output records=90382
HADOOP:                 Reduce input groups=79518
HADOOP:                 Reduce shuffle bytes=2007852
HADOOP:                 Reduce input records=90382
HADOOP:                 Reduce output records=0
HADOOP:                 Spilled Records=180764
HADOOP:                 Shuffled Maps =3
HADOOP:                 Failed Shuffles=0
HADOOP:                 Merged Map outputs=3
HADOOP:                 GC time elapsed (ms)=93
HADOOP:                 CPU time spent (ms)=7940
HADOOP:                 Physical memory (bytes) snapshot=1343377408
HADOOP:                 Virtual memory (bytes) snapshot=14458105856
HADOOP:                 Total committed heap usage (bytes)=4045406208
HADOOP:         Shuffle Errors
HADOOP:                 BAD_ID=0
HADOOP:                 CONNECTION=0
HADOOP:                 IO_ERROR=0
HADOOP:                 WRONG_LENGTH=0
HADOOP:                 WRONG_MAP=0
HADOOP:                 WRONG_REDUCE=0
HADOOP:         Unencodable output
HADOOP:                 TypeError=79518
HADOOP:         File Input Format Counters
HADOOP:                 Bytes Read=1012825
HADOOP:         File Output Format Counters
HADOOP:                 Bytes Written=0
HADOOP: 2015-10-25 17:01:08,321 INFO  [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1022)) - Output directory: hdfs:///user/andersaa/si601f15lab5_output
Counters from step 1:
  (no counters found)

I'm flummoxed as to why no counters would be found from step 1 (what I'm assuming to be the mapper portion of my code, which might be a false assumption). If I'm reading any of the Hadoop output correctly, it looks like it's making it at least to the reduce stage (since there are Reduce Input groups) and it's not finding any Shuffling errors. I think there might be some answers to what's going wrong in the "Unencodable output: TypeError=79518", but no amount of google searching that I've done has helped hone in on what error this is.

Any help or insights are greatly appreciated.

moskemerak
  • 99
  • 1
  • 8

2 Answers2

0

One problem is in the encoding of the bigram of the mapper. The way it is coded above, bigram is the python type "tuple":

>>> word = 'the'
>>> word2 = 'boy'
>>> bigram = word, word2
>>> type(bigram)
<type 'tuple'>

Usually, plain strings are used as the keys. So instead, create bigram as a string. One way you could do that is:

bigram = '-'.join((word, nextword))

When I make that change in your program, then I see outputs like this:

automatic-translation   1
automatic-vs    1
automatically-focus 1
automatically-learn 1
automatically-learning  1
automatically-translate 1
available-including 1
available-without   1

One other hint: try -q on your command line to silence all of the hadoop intermediate noise. Sometimes it just gets in the way.

HTH.

jeffmcc
  • 263
  • 3
  • 9
0

This is a cache error . I mostly found this with Hortonworks sandbox . Simple solution is to logout from the sandbox and ssh again ..

sapy
  • 8,952
  • 7
  • 49
  • 60