2

I am trying to load a json file as part of the mapper function but it returns "No such file in directory" although the file is existent.

I am already opening a file and parsing through its lines. But want to compare some of its values to a second JSON file.

from mrjob.job import MRJob
import json
import nltk
import re    

WORD_RE = re.compile(r"\b[\w']+\b")
sentimentfile = open('sentiment_word_list_stemmed.json') 

def mapper(self, _, line):
    stemmer = nltk.PorterStemmer()
    stems = json.loads(sentimentfile)

    line = line.strip()
    # each line is a json line
    data = json.loads(line)
    form = data.get('type', None)

    if form == 'review':
      bs_id = data.get('business_id', None)
      text = data['text']
      stars = data['stars']

      words = WORD_RE.findall(text)
      for word in words:
        w = stemmer.stem(word)
        senti = stems.get[w]

        if senti:
          yield (bs_id, (senti, 1))
Nicolas Hung
  • 595
  • 1
  • 6
  • 15

2 Answers2

5

You should not be opening a file in the mapper function at all. You only need to pass the file in as STDIN or as the first argument for the mapper to pick it up. Do it like this:

python mrjob_program.py sentiment_word_list_stemmed.json > output

OR

python mrjob_program.py < sentiment_word_list_stemmed.json > output

Either one will work. It says that there is no such file or directory because these mappers are not able to see the file that you are specifying. The mappers are designed to run on remote machines. Even if you wanted to read from a file in the mapper you would need to copy the file that you are passing to all machines in the cluster which doesn't really make sense for this example. You can actually specify a DEFAULT_INPUT_PROTOCOL so that the mapper know which type of input you are using as well.

Here is a talk on the subject that will help:

http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/

0

You are using the json.loads() function, while passing in an open file. Use json.load() instead (note, no s).

stems = json.load(sentimentfile)

You do need to re-open the file every time you call your mapper() function, better just store the filename globally:

sentimentfile = 'sentiment_word_list_stemmed.json'

def mapper(self, _, line):
    stemmer = nltk.PorterStemmer()
    stems = json.load(open(sentimentfile))

Last but not least, you should use a absolute path to the filename, and not rely on the current working directory being correct.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343