Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

vote

1 answer

How do you point mrjob EMR to the right AWS account? I keep getting a ssh key invalid message

I have set .mrjob.conf like this (passwords changed): runners: emr: aws_access_key_id: JKDJKAJSLKJAFKLJ aws_secret_access_key: RKLJDKAS/KLASJKFJKSJAKSALLKLKS ec2_key_pair: me-east ec2_key_pair_file: /Users/me/.ssh/me-east.pem …

python amazon-web-services amazon-emr mrjob

asked Jul 27 '18 at 18:51

TheSneak

vote

1 answer

Python3 MRJob outputs unsorted key-value pairs

Context Python 3.6.3 :: Anaconda custom (64-bit) mrjob==0.6.2 with no custom configuration Running locally I am implementing the basic word count example for a local map reduce job. My mapper maps a 1 to each word in each line of a book from a .txt…

python python-3.x mapreduce mrjob

asked May 09 '18 at 01:44

ekauffmann

vote

0 answers

Running Mr. Job in Hadoop mode. "Error launching job , bad input path : File does not exist:"

I'm running Apache Hadoop 3.1.0 in pseudo-distributed mode using the default configurations from the Wiki. I've created a simple python program that counts the article tag in the dblp.xml file posted below from mrjob.job import MRJob import…

python apache hadoop mrjob

asked Apr 16 '18 at 02:04

bnguyen1994

vote

1 answer

Python Map Reduce Mr job

I am new to python programming so excuse me in advance if I ask something that is easily solved. I want to use MapReduce for processing a csv file that has some values and the output must return the maximum value.This is the script i've written so…

python hadoop mapreduce max mrjob

asked Mar 16 '18 at 15:55

Kyr

vote

0 answers

python mapreduce mrjob max value

Given a csv file, where each line contains a set of number, i want to write a map reduce program which determines the maximum number of all numbers in the file. lets say the csv file is 3,4 5,6 the script should return 6. from mrjob.job import…

python mapreduce mrjob

asked Mar 12 '18 at 17:21

Kyr

vote

1 answer

How to implement mapreduce pairs pattern in python

I am trying to attempt the mapreduce pairs pattern in python. Need to check if a word is in a text file and then find the word next to it and yield a pair of both words. keep running into either: neighbors = words[words.index(w) + 1] ValueError:…

python mapreduce mrjob

asked Dec 12 '17 at 12:36

Jackob

vote

0 answers

Python mrjob running locally but missing lots of data

I can run python mrjob locally and it's much faster. But when I look into the output results, it's missing data, and lost a lot of data. I'm wondering whether this is because there is a function in my code cost longer time to run, and therefore all…

python mapreduce missing-data mrjob

asked Nov 19 '17 at 00:09

Cherry Wu

3,844
9
43
63

vote

1 answer

Multiple input files for each mapper 'type'

I am trying to run a job where each mapper 'type' recieves a different input file. I know there is a way to do this with Java using MultipleInputs class like so: MultipleInputs.addInputPath(job,new…

java python hadoop mapreduce mrjob

asked Sep 25 '17 at 17:39

Rohin Gopalakrishnan

vote

2 answers

what is the location of mrjob.conf file?

My mrjob with hadoop streaming fails. I have a hadoop sandbox on oracle vm with python module mrjob. Need to make some changes in mrjob.conf as suggested in Hadoop Error: Error launching job , bad input path : File does not exist.Streaming Command…

python hadoop virtual-machine mrjob

asked Sep 21 '17 at 07:10

Namrata Tolani

vote

1 answer

Python: Can't convert 'bytes' object to str implicitly

Here's my code: class ReviewCategoryClassifier(object): @classmethod def load_data(cls, input_file): job = category_predictor.CategoryPredictor() category_counts = None word_counts = {} with…

python json class byte mrjob

asked May 16 '17 at 02:03

Candice Zhang

vote

0 answers

Amazon EMR streaming Could not find any valid local directory for output

I am getting following logs(at bottom of question) in stderr of failed EMR process. Can someone explain what is going on? I can not understand the traceback properly. And what is the solution. I am using python mrjob framework to run the streaming…

hadoop emr hadoop-streaming amazon-emr mrjob

asked Mar 15 '17 at 13:23

shreyas

2,510
4
19
20

vote

1 answer

MRJob-Finding the length of values for reduer

I write a program based on MapReduce using MRJob. I have a question about the parameters of reducer. As you know, Reducer function takes two parameters which are key and values. I want to find the length of values without writing any loop condition…

python-2.7 mapreduce generator mrjob

asked Feb 06 '17 at 19:11

ugur

vote

2 answers

Map/reduce two-stage ordering of counts

This python3 program attempts to produce a frequency list of words from a text file using map/reduce. I would like to know how to order the word counts, represented as 'count' in the second reducer's yield statement so that the largest count values…

python hadoop mrjob

asked Feb 04 '17 at 02:54

Rick Lentz

vote

0 answers

no matches found when accessing multiple files for hadoop job on EMR

I'm trying to run a hadoop job on AWS EMR that I execute locally using python on files in s3. I cannot seem to be able to access multiple files using *. I want to be able to access all files from the folder 01 on. This code works on all files in…

python hadoop amazon-s3 mrjob

asked Feb 03 '17 at 00:15

Conor B Murphy

vote

1 answer

MapReduce: ValueError: too many values to unpack (expected 2)

I'm running the following Python code in MapReduce: from mrjob.job import MRJob import collections bigram = collections.defaultdict(float) unigram = collections.defaultdict(float) class MRWordFreqCount(MRJob): def mapper(self, _, line): …

python hadoop mapreduce mrjob

asked Dec 03 '16 at 18:50

Reddspark

6,934
9
47
64

Prev 1 2 3

…

22 23 Next