Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

vote

0 answers

Load Pickle generated with mrjob

I've built a statistical model (LogisticRegression from sklearn) with mrjob and saved it to disk via the PickleValueProtocol and piping the output to a file. Now when loading the pickled file from another python script, I get the following…

python scikit-learn pickle mrjob

asked Aug 13 '15 at 13:30

vkoe

vote

0 answers

What influences the sort order of mrjob output?

I have a project based on mrjob, with automated tests. One test runs mrjob locally against known input, and asserts the actual output matches expected output. The issue is that the test passes in development environment, but fails in continous…

python mrjob

asked Aug 10 '15 at 21:46

ppbitb

vote

0 answers

set up mrjob with Hadoop failed with the error "returned non-zero exit status 256"

i am a newer about mrjob and hadoop, after i build my hadoop cluster, i try to use mrjob submit the job to hadoop, but unfortunatly, it failed with the error "returned non-zero exit status 256".more details as follow: 1.this is my example: from…

python hadoop mrjob

asked Jul 23 '15 at 03:18

wsshopping

vote

1 answer

Amazon EMR + mrjob: bootstrap error, "bootstrap action 1 returned a non-zero return code"

I am trying to run an mrjob on Amazon's EMR using ec2 instances. It was working until I realized I was using python packages (mechanize, BeautifulSoup, boto). So, I added to my mrjob.conf file, but now I keep getting this error: No handlers could be…

amazon-ec2 emr bootstrapping mrjob

asked Jul 01 '15 at 16:48

William Dewayne Corrin III

vote

1 answer

MRJob - Limit Number of Task Attemps

In MyJob, how do you limit the number of task attempts (if a task fails)? I have long running tasks (have increased the timeout, accordingly), but I want the job to end after 2 failed attempts at the same task, rather than 4-5. I couldn't find…

emr mrjob

asked Jun 19 '15 at 21:57

okoboko

4,332
8
40
67

vote

1 answer

What does reduce() do without mapper() in MRJob?

I am new to python and trying to build a recommendation system following the instruction http://www.yekeren.com/blog/archives/1005, what confuses me is that : def reducer3_init(self): self.pop = { } file =…

python-2.7 hadoop mrjob

asked Apr 26 '15 at 07:36

danielsonzz

vote

1 answer

mrjob virtualenv error in Hadoop cluster: Permission denied

I work at a large corporate organization where we have a Hadoop cluster. I got the admin to install virtualenv on all the Hadoop worker nodes so that I can submit mrjobs with standard Python dependencies that may not exist on the worker nodes. As…

python hadoop pip virtualenv mrjob

asked Apr 10 '15 at 23:50

abhinavkulkarni

2,284
4
36
54

vote

1 answer

With MapReduce is it guaranteed that ALL values with the same key will go to the same reducer?

I have a MapReduce project I am working on (specifically I am using Python and the library MrJob and plan on running using Amazon's EMR). Here is an example to sum up the issue I am having: I have thousands of GB of json files full of customer data.…

python hadoop mapreduce bigdata mrjob

asked Feb 17 '15 at 06:51

Brad Barrows

1,633
1
13
12

vote

0 answers

How to specify different AWS credentials for EMR and S3 when using MRJob

I can specify what AWS credentials to use to create an EMR cluster via environment variables. However, I would like to run a mapreduce job on another AWS user's S3 bucket for which they gave me a different set of AWS credentials. Does MRJob provide…

mrjob

asked Nov 13 '14 at 08:56

Razzi Abuissa

3,337
2
28
29

vote

1 answer

Using a combiner in hadoop streaming mapreduce (using mrjob)

When I was taught about mapreduce one of the key components was the combiner. It is a step between the mapper and the reducer which essentially runs the reducer at the end of the map phase in order to decrease the number of lines of data that the…

hadoop mapreduce hadoop-streaming mrjob

asked Sep 03 '14 at 04:52

Narek

vote

1 answer

how to run mrjob on EMR

I tried to run mapreduce by following this tutorial. I uploaded the files mrjob.conf, readme.txt and word_count.py on EC2 instance in the folder ~/hello_mapreduce and tryed to run the command: python word_count.py -r emr README.txt which returned…

amazon-web-services mapreduce mrjob

asked Aug 09 '14 at 12:32

Niko Gamulin

66,025
95
221
286

vote

1 answer

Decompress + un-tar input files during mrjob execution

I would like to process lots of data in S3 efficient with mrjob (using EMR). I can structure the data any way I would like, but clearly I would like to do everything I can to play to the strengths of having EMR run on S3 data. My data consists of…

python amazon-web-services amazon-s3 emr mrjob

asked Aug 07 '14 at 21:29

user2013116

vote

1 answer

How do you filter s3 files before sending input to mrjob mapper?

I'm trying to MapReduce logs, and I'd like to filter all logs in a bucket by filename before processing them in EMR. Also, some files are tar directories, and I'd like mrjob to uncompress it, then filter files in it to only parse the relevant…

python amazon-s3 mapreduce emr mrjob

asked Jun 11 '14 at 23:25

Adrien Lemaire

1,744
2
20
29

vote

2 answers

Bootstrapping libraries on EMR using python MRJob

Problem Statement: I am trying to run a map-reduce job in Amazon EMR using python MRJob library, and I am having trouble with bootstrapping the nodes with the requisite libraries and packages. Details: my sample python mrjob code: import re …

python hadoop nltk emr mrjob

asked May 03 '14 at 05:15

Shreyas

vote

1 answer

Is it possible to process multi-line records using Hadoop Streaming?

I have records like this: Name: Alan Kay Email: Alan.Kay@url.com Date: 09-09-2013 Name: Marvin Minsky Email: Marvin.Minsky@url.com City: Boston, MA Date: 09-10-2013 Name: Alan Turing City: New York City, NY Date: 09-10-2013 They're multiline but…

java hadoop multiline hadoop-streaming mrjob

asked Apr 08 '14 at 13:08

duber

2,769
4
24
32

Prev 1 2 3

…

22 23 Next