1

I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.

mapper.py:

#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,

reducer.py:

#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

Console command:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

This runs perfectly, with the output simply containing the lines of the input file.

However, when this line (from mapper.py):

#print stopwords.words('english')

is uncommented, then the program fails and says

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I have checked and in a standalone python program,

print stopwords.words('english')

works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.

I would greatly appreciate any help! Thank you

karthikr
  • 97,368
  • 26
  • 197
  • 188
Objc55
  • 156
  • 1
  • 5
  • 18
  • You don't have ntlk corpus in your hadoop directory. Try this http://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming – user1525721 Sep 27 '13 at 18:41
  • Try this also---http://stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job – user1525721 Sep 27 '13 at 18:52
  • @user1525721 Thanks for the replies. Will give it a try and post back. If I have NLTK on all the nodes though, would this still be necessary? – Objc55 Sep 27 '13 at 19:56
  • you provide path to your mapper n reducer. Similarly you have to point out your python libraries in order to use it. – user1525721 Sep 27 '13 at 20:36
  • @user1525721 Thanks for the clarification. Another question -- how come `from nltk.corpus import stopwords` doesn't cause it to fail? – Objc55 Sep 27 '13 at 20:40
  • @user1525721 Update: I have tried the same console command with the addition of `-file nltk_data.zip` and `-archives stopwords.zip`. Unfortunately, having the same problem. – Objc55 Sep 27 '13 at 21:23

2 Answers2

0

Use these commands to unzip :

importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

CHeck the links which I pasted above. They have mentioned all the steps.

user1525721
  • 336
  • 5
  • 12
  • Do I need to send these files in the console command or have them stored locally on each machine? Also, do I need nltk.zip or nltk_data.zip? How can I find the former? What role does yaml play in this? Thanks! – Objc55 Sep 30 '13 at 16:31
  • I have tried what you suggested, and have imported nltk and yaml without any problems. However, I still cannot get the stopwords working. `from nltk.corpus import stopwords` does not cause the program to fail, but as soon as I type `print stopwords.words('english')`, it fails. Any idea how to fix? I have included this in the console command: `-archives ./stopwords.zip` Thanks! – Objc55 Sep 30 '13 at 17:23
0

Is 'english' a file in print stopwords.words('english')? If yes, you need to use -file for that too to send it across the nodes.

SSaikia_JtheRocker
  • 5,053
  • 1
  • 22
  • 41