Hadoop and NLTK: Fails with stopwords

Question

I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.

mapper.py:

#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,

reducer.py:

#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

Console command:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

This runs perfectly, with the output simply containing the lines of the input file.

However, when this line (from mapper.py):

#print stopwords.words('english')

is uncommented, then the program fails and says

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I have checked and in a standalone python program,

print stopwords.words('english')

works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.

I would greatly appreciate any help! Thank you

You don't have ntlk corpus in your hadoop directory. Try this http://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming — user1525721, Sep 27 '13 at 18:41
Try this also---http://stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job — user1525721, Sep 27 '13 at 18:52
@user1525721 Thanks for the replies. Will give it a try and post back. If I have NLTK on all the nodes though, would this still be necessary? — Objc55, Sep 27 '13 at 19:56
you provide path to your mapper n reducer. Similarly you have to point out your python libraries in order to use it. — user1525721, Sep 27 '13 at 20:36
@user1525721 Thanks for the clarification. Another question -- how come `from nltk.corpus import stopwords` doesn't cause it to fail? — Objc55, Sep 27 '13 at 20:40
@user1525721 Update: I have tried the same console command with the addition of `-file nltk_data.zip` and `-archives stopwords.zip`. Unfortunately, having the same problem. — Objc55, Sep 27 '13 at 21:23

score 0 · Answer 1 · answered Sep 27 '13 at 23:56

0

Use these commands to unzip :

importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

CHeck the links which I pasted above. They have mentioned all the steps.

answered Sep 27 '13 at 23:56

user1525721

336
5
12

Do I need to send these files in the console command or have them stored locally on each machine? Also, do I need nltk.zip or nltk_data.zip? How can I find the former? What role does yaml play in this? Thanks! – Objc55 Sep 30 '13 at 16:31
I have tried what you suggested, and have imported nltk and yaml without any problems. However, I still cannot get the stopwords working. `from nltk.corpus import stopwords` does not cause the program to fail, but as soon as I type `print stopwords.words('english')`, it fails. Any idea how to fix? I have included this in the console command: `-archives ./stopwords.zip` Thanks! – Objc55 Sep 30 '13 at 17:23

score 0 · Accepted Answer · answered Sep 30 '13 at 22:07

0

Is 'english' a file in print stopwords.words('english')? If yes, you need to use -file for that too to send it across the nodes.

answered Sep 30 '13 at 22:07

SSaikia_JtheRocker

5,053
1
22
41

Hadoop and NLTK: Fails with stopwords

2 Answers2

Linked