2

I'm using Hadoop streaming to run some Python code. I have noticed that if there is an error in my Python code (in mapper.py, for example), I won't be notified about the error. Instead, the mapper program will fail to run, and the job will be killed after a few seconds. Viewing the logs, the only error I see is that mapper.py failed to run or was not found, which is clearly not the case.

My question is, is there a specific log file I can check to see actual errors that may exist in the mapper.py code? (For example, would tell me if an import command failed)

Thank you!

edit: The command used:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

and the post I am referencing for which I'd like to see the errors: Hadoop and NLTK: Fails with stopwords

Community
  • 1
  • 1
Objc55
  • 156
  • 1
  • 5
  • 18
  • 1
    Not a real solution, but I find useful to just test both the mapper and the reducer locally, with a small subset of the data. For example `cat testData | ./mapper.py | ./reducer.py` – vinaut Sep 30 '13 at 17:45
  • @vinaut Thanks for the reply. I actually have done this, but I've run into a script that works locally, but not when run through Hadoop. – Objc55 Sep 30 '13 at 17:50
  • Ah, ok :). What is the test data like, is every line a complete record? – vinaut Sep 30 '13 at 18:02
  • @vinaut Yeah. The input file consists of a few lines, with each line just having a number. It's very basic, I'm just trying to understand the basic concepts. The issue I'm having is strange, though. If you're interested, I've just posted about it here: http://stackoverflow.com/questions/19057741/hadoop-and-nltk-fails-with-stopwords – Objc55 Sep 30 '13 at 18:06
  • So, the mapper is just printing the input as it is, a line at a time ? – vinaut Sep 30 '13 at 18:17
  • @vinaut Yeah, exactly – Objc55 Sep 30 '13 at 18:23
  • How are you executing the hadoop job? Can you please show the command? – SSaikia_JtheRocker Sep 30 '13 at 18:30
  • @JtheRocker I've edited the post with this information. Thanks! – Objc55 Sep 30 '13 at 18:33

2 Answers2

1

About the log question, see it this helps :

MapReduce: Log file locations for stdout and std err

I suppose that if the python file fails to run, then the interpreter should print to stdout, and you would see it in the stdout log of that node.

Community
  • 1
  • 1
vinaut
  • 2,416
  • 15
  • 13
  • Thank you for the reply. When I check that log, I see this message: `java.io.IOException: Cannot run program "mapper.py": error=2, No such file or directory`. I am confused by this though, as the program clearly is being run when `print stopwords.words('english')` is commented out. – Objc55 Sep 30 '13 at 18:56
  • So you are printing to stdout a list of words, and then a bunch of lines with just numbers in it ? I don't know if hadoop will understand that, the mapper should print consistent data, line after line, with the format , by default : `key /t value`. I admit that this doesn't explain the error you are getting, but try to make a loop over the word list, and for each word `print word,"\t 1"`, and nothing else. – vinaut Sep 30 '13 at 19:04
  • Sorry for the delayed reply. I think it should understand a print command before the `for` loop, though. When I type `print "hello"` instead of `print stopwords.words('english')`, it works fine. For some reason, even though `from nltk.corpus import stopwords` works successfully, `print stopwords.words('english')` causes an error. – Objc55 Sep 30 '13 at 19:44
  • I'm not so sure. Whatever you print, it should interpret it as valid data to process. It might be ok if you just print a word like hello (it would interpret it as a key "hello" with NULL value) but not if you print an entire list. It's just a supposition though. But I'm fairly certain that in general, unless configured somewhere, your mapper and reducer should print only and just only consistent data. When you are using streaming Hadoop is "dumb", it assumes it is being fed data in the form it expects. If you need to print something else that is not meant for mapper, save it in a file. – vinaut Sep 30 '13 at 20:15
  • I just tested it, and it does indeed also work with a test list, so I think it should work here, too. The test list, however, did print twice for some weird reason as described here: http://stackoverflow.com/questions/19011036/hadoop-output-file-has-double-output – Objc55 Sep 30 '13 at 20:29
  • I'm sorry then, I can't think of anything else :). – vinaut Sep 30 '13 at 20:43
0

Edit after comments:

Assuming you are in a fully distributed environment for Hadoop and you know how to configure NLTK for python in a node, you need to have NLTK package present and in all the nodes of the cluster to actually have python import from nltk.corpus import stopwords and use the command stopwords.words('english').

In my view, NLTK needs to manually configured in all the nodes of the cluster for the mapper.py python script to actually work.

If these doesn't help, please try this link out, which talks specifically about NLTK as an example, if you scroll down.

SSaikia_JtheRocker
  • 5,053
  • 1
  • 22
  • 41
  • Thank you for the reply. I have tried this and unfortunately, am running into the same issue. The mapper works perfectly when the line `print stopwords.words('english')` is commented out. In fact, it's even able to import the stopwords library without error. However, it seems to be the uncommenting of that line that's leading to the job stopping at 0% map, 0% reduce. – Objc55 Sep 30 '13 at 18:46
  • did you again get the *No such file or directory error* message, with what I proposed? – SSaikia_JtheRocker Sep 30 '13 at 19:06
  • Quick question, why are you printing the stop words? – SSaikia_JtheRocker Sep 30 '13 at 19:12
  • Sorry for the delay. Yes, I unfortunately get that same error message with your method. My full program actually isn't going to print the stopwords. As you said, I will use them for filtering. I've simplified it for the moment to just printing them so I can make sure the program can access those words without error. – Objc55 Sep 30 '13 at 19:42
  • Another question, is **'english'** a file in `print stopwords.words('english')`? If yes you need to use `-file` for that too to send it across the nodes. – SSaikia_JtheRocker Sep 30 '13 at 19:51
  • I think so, but I'm not sure... the stopwords folder contains a file called english. I have zipped this folder and added `-file stopwords.zip` to the console command. Trying that, I am now met with the following error: `Resource 'corpora/stopwords' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()` However, I have already done this on all the nodes. – Objc55 Sep 30 '13 at 19:59
  • I have updated the answer a bit. Please have a look and come up with questions if any. – SSaikia_JtheRocker Sep 30 '13 at 21:30
  • Thanks for the reply. I actually have already installed NLTK on each node and verified that each node can print the stopwords independently. – Objc55 Sep 30 '13 at 21:34
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/38374/discussion-between-jtherocker-and-objc55) – SSaikia_JtheRocker Sep 30 '13 at 21:46
  • Actually, you were right when you said `Another question, is 'english' a file in print stopwords.words('english')? If yes you need to use -file for that too to send it across the nodes.` I seem to have done that incorrectly the first time and now it works. If you'd like to post that same answer on the relevant thread I will accept it. Thanks! – Objc55 Sep 30 '13 at 21:52
  • I have added a new link, that talks specifically about NLTK in Hadoop. Please have a look. – SSaikia_JtheRocker Sep 30 '13 at 21:54
  • It was the -file parameter, which I had incorrectly entered before. Thank you for the help! – Objc55 Oct 01 '13 at 06:25