1

Here's my code:

class ReviewCategoryClassifier(object):
      @classmethod
      def load_data(cls, input_file):
           job = category_predictor.CategoryPredictor()
           category_counts = None
           word_counts = {}

           with open(input_file) as src:
              for line in src:
                category, counts = job.parse_output_line(line)

      def __init__(self, input_file):
      """input_file: the output of the CategoryPredictor job."""
           category_counts, word_counts = self.load_data(input_file)

           self.word_given_cat_prob = {}
           for cat, counts in word_counts.iteritems():
               self.word_given_cat_prob[cat] = self.normalize_counts(counts)

              # filter out categories which have no words
               seen_categories = set(word_counts)
               seen_category_counts = dict((cat, count) for cat, count in 
                                      category_counts.iteritems() \
                                            if cat in seen_categories)
               self.category_prob= self.normalize_counts(
                                                      seen_category_counts)

if __name__ == "__main__":
     input_file = sys.argv[1]
     text = sys.argv[2]
     guesses = ReviewCategoryClassifier(input_file).classify(text)

btw CategoryPredictor() is a mrjob project.

Whenever I typed in

python predict.py yelp_academic_dataset_review.json 'I like donut'

in the command line, it has an error that says:

TypeError: Can't convert 'bytes' object to str implicitly

But line is a string instead a bytes object. What did I do wrong?

Here's the full traceback

Traceback (most recent call last):
File "predict.py", line 116, in <module>
  guesses = ReviewCategoryClassifier(input_file).classify(text)
File "predict.py", line 65, in __init__
  category_counts, word_counts = self.load_data(input_file)
File "predict.py", line 44, in load_data
  category, counts = job.parse_output_line(line)
File "//anaconda/lib/python3.5/site-packages/mrjob/job.py", line 961, in 
    parse_output_line
return self.output_protocol().read(line)
File "//anaconda/lib/python3.5/site-packages/mrjob/protocol.py", line 84, in 
  read
       raw_key, raw_value = line.split(b'\t', 1)
TypeError: Can't convert 'bytes' object to str implicitly
Candice Zhang
  • 211
  • 1
  • 3
  • 10
  • The `b'\t'` is a bytes instead of a `str`. Try to use `with open(input_file, 'r', encoding="utf-8") as src`. – stamaimer May 16 '17 at 02:13

1 Answers1

1

You need to pass bytes to MRJob.parse_output_line; open input_file with binary mode

with open(input_file, 'rb') as src:
    for line in src:
        category, counts = job.parse_output_line(line)

or encode the line before passing to the method:

with open(input_file) as src:
    for line in src:
        category, counts = job.parse_output_line(line.encode())
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Hi: if i added 'rb', the error became: raw_key, raw_value = line.split(b'\t', 1) ValueError: not enough values to unpack (expected 2, got 1) – Candice Zhang May 16 '17 at 02:33
  • @CandiceZhang, Is the `input_file` MRJob output? – falsetru May 16 '17 at 02:35
  • Hi: I am thinking about the same thing now. The input file is the first argument in my command line, which is a datafile, instead of output from MRjob. How can I pass a MRjob output into the command line ? Thanks! – Candice Zhang May 16 '17 at 02:38
  • @CandiceZhang, If you follow the link in the answer, you will see an example. – falsetru May 16 '17 at 02:39
  • Hi: In my load data function, first it does job = category_predictor(). Because the input of the category_predictor class is a json file, the later on code in the for loop, with open(input_file) as src: for line in src: category, counts = job.parse_output_line(line), doesn't it mean that the input_file here should be a json file, instead of output from category_predict, the MRjob? – Candice Zhang May 16 '17 at 02:51
  • @CandiceZhang, I cannot tell with the current code. BTW, it's hard to read the code in the comment – falsetru May 16 '17 at 02:55
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/144301/discussion-between-candice-zhang-and-falsetru). – Candice Zhang May 16 '17 at 03:01