Opening files on HDFS from Hadoop mapreduce job

Question

Usually, I can open a new file with something like this:

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

This will open up the two relevant text files in the WordLists folder and append each line to the dictionary as either positive or negative.

When I want to run a mapreduce job within Hadoop however, I don't think this works. I am running my program like so:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

I have tried to change the code to this:

with open('/mapreduce/WordLists/negative_words.txt', 'r')

where mapreduce is a folder on the HDFS, with WordLists a subfolder containing negative words. But my program doesn't find this. Is what I'm doing possible and if so, what is the correct way to load files on the HDFS.

Edit

I've now tried:

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

This seems to do something, but now I get this sort of output:

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

Then a job fail. So still not right. Any ideas?

Edit 2:

Having re-read the API, I notice I can use the -files option in the terminal to specify files. The API states:

The -files option creates a symlink in the current working directory of the tasks that points to the local copy of the file.

In this example, Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.

-files hdfs://host:fs_port/user/testfile.txt

Therefore, I run:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

From my understanding of the API, this creates symlinks so I can use "positive_words" and "negative_words" in my code, like this:

with open('negative_words.txt', 'r')

However, this still doesn't work. Any help anyone can offer would be hugely appreciated as I can't do much until I solve this.

Edit 3:

I can use this command:

-file ~/Twitter/SentimentWordLists/positive_words.txt

along with the rest of my command to run the Hadoop job. This finds the file on my local system rather than HDFS. This doesn't throw any errors, so it's accepted somewhere as a file. However, I've no idea how to access the file.

Try `-files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words,hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words` — Alfonso Nishikawa, Aug 28 '13 at 07:41
Tried that one too. I've deleted all code within the mapper class that touches those files to see if the actual import is even working - it isn't! That just brings up the same streaming command error — Andrew Martin, Aug 28 '13 at 08:12
Yeah, but if I omit the first one and JUST use the -files part to import the files, an error occurs - meaning there's no point trying to open the files within the mapper class yet as they're not even being properly imported so are unreachable. — Andrew Martin, Aug 28 '13 at 08:23
a) What does log say? (if any) b) Shouldn't be command line `-mapper hadoop_map.py` ? — Alfonso Nishikawa, Aug 28 '13 at 08:35
Whoops that was a typo by me. It is hadoop_map. Log just says: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 — Andrew Martin, Aug 28 '13 at 08:39
Sending with `-file` will put the file in the task's working directory. Try with the first answers of http://stackoverflow.com/questions/4339788/hadoop-streaming-unable-to-find-file-error — Alfonso Nishikawa, Aug 28 '13 at 08:52
Does your configuration have: `mapred.create.symlinkyes`?. You can check for created symlinks in `logs/userlogs/.../syslog` — Alfonso Nishikawa, Aug 28 '13 at 08:57
I don't have your last comment, but I don't need it. You're a star! Following the first answer in that other link, of using this: sys.path.append('.'), meant that I could use: with open('positive_words.txt', 'r') as f: sentimentDict['positive'] = {line.strip() for line in f} — Andrew Martin, Aug 28 '13 at 08:58
So everything's working perfectly. Thank you so much for your time and effort! If you want, feel free to add this as an answer and I'll gladly accept :) — Andrew Martin, Aug 28 '13 at 08:58

score 3 · Accepted Answer · edited May 23 '17 at 12:12

3

Solution after plenty comments :)

Reading a data file in python: send it with -file and add to your script the following:

import sys

Sometimes is needed to add after the import:

sys.path.append('.')

(related to @DrDee comment in Hadoop Streaming - Unable to find file error)

edited May 23 '17 at 12:12

Community

1
1

answered Aug 28 '13 at 09:12

Alfonso Nishikawa

1,876
1
17
33

Thanks a lot! appending on path wasn't necessary though! – Paschalis Mar 21 '14 at 23:15

score 1 · Answer 2 · answered Aug 28 '13 at 00:26

1

when dealing with HDFS programatically you should look into FileSystem, FileStatus, and Path. These are hadoop API classes which allow you to access HDFS within your program.

answered Aug 28 '13 at 00:26

Daniel Imberman

618
1
5
18

1

They are all Java API - I'm using Python. – Andrew Martin Aug 28 '13 at 07:36

Opening files on HDFS from Hadoop mapreduce job

2 Answers2