Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
1
vote
1 answer

How do you point mrjob EMR to the right AWS account? I keep getting a ssh key invalid message

I have set .mrjob.conf like this (passwords changed): runners: emr: aws_access_key_id: JKDJKAJSLKJAFKLJ aws_secret_access_key: RKLJDKAS/KLASJKFJKSJAKSALLKLKS ec2_key_pair: me-east ec2_key_pair_file: /Users/me/.ssh/me-east.pem …
TheSneak
  • 500
  • 6
  • 15
1
vote
1 answer

Python3 MRJob outputs unsorted key-value pairs

Context Python 3.6.3 :: Anaconda custom (64-bit) mrjob==0.6.2 with no custom configuration Running locally I am implementing the basic word count example for a local map reduce job. My mapper maps a 1 to each word in each line of a book from a .txt…
ekauffmann
  • 150
  • 10
1
vote
0 answers

Running Mr. Job in Hadoop mode. "Error launching job , bad input path : File does not exist:"

I'm running Apache Hadoop 3.1.0 in pseudo-distributed mode using the default configurations from the Wiki. I've created a simple python program that counts the article tag in the dblp.xml file posted below from mrjob.job import MRJob import…
1
vote
1 answer

Python Map Reduce Mr job

I am new to python programming so excuse me in advance if I ask something that is easily solved. I want to use MapReduce for processing a csv file that has some values and the output must return the maximum value.This is the script i've written so…
Kyr
  • 31
  • 5
1
vote
0 answers

python mapreduce mrjob max value

Given a csv file, where each line contains a set of number, i want to write a map reduce program which determines the maximum number of all numbers in the file. lets say the csv file is 3,4 5,6 the script should return 6. from mrjob.job import…
Kyr
  • 31
  • 5
1
vote
1 answer

How to implement mapreduce pairs pattern in python

I am trying to attempt the mapreduce pairs pattern in python. Need to check if a word is in a text file and then find the word next to it and yield a pair of both words. keep running into either: neighbors = words[words.index(w) + 1] ValueError:…
Jackob
  • 11
  • 4
1
vote
0 answers

Python mrjob running locally but missing lots of data

I can run python mrjob locally and it's much faster. But when I look into the output results, it's missing data, and lost a lot of data. I'm wondering whether this is because there is a function in my code cost longer time to run, and therefore all…
Cherry Wu
  • 3,844
  • 9
  • 43
  • 63
1
vote
1 answer

Multiple input files for each mapper 'type'

I am trying to run a job where each mapper 'type' recieves a different input file. I know there is a way to do this with Java using MultipleInputs class like so: MultipleInputs.addInputPath(job,new…
1
vote
2 answers

what is the location of mrjob.conf file?

My mrjob with hadoop streaming fails. I have a hadoop sandbox on oracle vm with python module mrjob. Need to make some changes in mrjob.conf as suggested in Hadoop Error: Error launching job , bad input path : File does not exist.Streaming Command…
Namrata Tolani
  • 823
  • 9
  • 12
1
vote
1 answer

Python: Can't convert 'bytes' object to str implicitly

Here's my code: class ReviewCategoryClassifier(object): @classmethod def load_data(cls, input_file): job = category_predictor.CategoryPredictor() category_counts = None word_counts = {} with…
Candice Zhang
  • 211
  • 1
  • 3
  • 10
1
vote
0 answers

Amazon EMR streaming Could not find any valid local directory for output

I am getting following logs(at bottom of question) in stderr of failed EMR process. Can someone explain what is going on? I can not understand the traceback properly. And what is the solution. I am using python mrjob framework to run the streaming…
shreyas
  • 2,510
  • 4
  • 19
  • 20
1
vote
1 answer

MRJob-Finding the length of values for reduer

I write a program based on MapReduce using MRJob. I have a question about the parameters of reducer. As you know, Reducer function takes two parameters which are key and values. I want to find the length of values without writing any loop condition…
ugur
  • 400
  • 6
  • 20
1
vote
2 answers

Map/reduce two-stage ordering of counts

This python3 program attempts to produce a frequency list of words from a text file using map/reduce. I would like to know how to order the word counts, represented as 'count' in the second reducer's yield statement so that the largest count values…
Rick Lentz
  • 503
  • 6
  • 18
1
vote
0 answers

no matches found when accessing multiple files for hadoop job on EMR

I'm trying to run a hadoop job on AWS EMR that I execute locally using python on files in s3. I cannot seem to be able to access multiple files using *. I want to be able to access all files from the folder 01 on. This code works on all files in…
1
vote
1 answer

MapReduce: ValueError: too many values to unpack (expected 2)

I'm running the following Python code in MapReduce: from mrjob.job import MRJob import collections bigram = collections.defaultdict(float) unigram = collections.defaultdict(float) class MRWordFreqCount(MRJob): def mapper(self, _, line): …
Reddspark
  • 6,934
  • 9
  • 47
  • 64