Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
1
vote
0 answers

Load Pickle generated with mrjob

I've built a statistical model (LogisticRegression from sklearn) with mrjob and saved it to disk via the PickleValueProtocol and piping the output to a file. Now when loading the pickled file from another python script, I get the following…
vkoe
  • 381
  • 4
  • 12
1
vote
0 answers

What influences the sort order of mrjob output?

I have a project based on mrjob, with automated tests. One test runs mrjob locally against known input, and asserts the actual output matches expected output. The issue is that the test passes in development environment, but fails in continous…
ppbitb
  • 519
  • 1
  • 7
  • 19
1
vote
0 answers

set up mrjob with Hadoop failed with the error "returned non-zero exit status 256"

i am a newer about mrjob and hadoop, after i build my hadoop cluster, i try to use mrjob submit the job to hadoop, but unfortunatly, it failed with the error "returned non-zero exit status 256".more details as follow: 1.this is my example: from…
wsshopping
  • 11
  • 6
1
vote
1 answer

Amazon EMR + mrjob: bootstrap error, "bootstrap action 1 returned a non-zero return code"

I am trying to run an mrjob on Amazon's EMR using ec2 instances. It was working until I realized I was using python packages (mechanize, BeautifulSoup, boto). So, I added to my mrjob.conf file, but now I keep getting this error: No handlers could be…
1
vote
1 answer

MRJob - Limit Number of Task Attemps

In MyJob, how do you limit the number of task attempts (if a task fails)? I have long running tasks (have increased the timeout, accordingly), but I want the job to end after 2 failed attempts at the same task, rather than 4-5. I couldn't find…
okoboko
  • 4,332
  • 8
  • 40
  • 67
1
vote
1 answer

What does reduce() do without mapper() in MRJob?

I am new to python and trying to build a recommendation system following the instruction http://www.yekeren.com/blog/archives/1005, what confuses me is that : def reducer3_init(self): self.pop = { } file =…
1
vote
1 answer

mrjob virtualenv error in Hadoop cluster: Permission denied

I work at a large corporate organization where we have a Hadoop cluster. I got the admin to install virtualenv on all the Hadoop worker nodes so that I can submit mrjobs with standard Python dependencies that may not exist on the worker nodes. As…
abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54
1
vote
1 answer

With MapReduce is it guaranteed that ALL values with the same key will go to the same reducer?

I have a MapReduce project I am working on (specifically I am using Python and the library MrJob and plan on running using Amazon's EMR). Here is an example to sum up the issue I am having: I have thousands of GB of json files full of customer data.…
Brad Barrows
  • 1,633
  • 1
  • 13
  • 12
1
vote
0 answers

How to specify different AWS credentials for EMR and S3 when using MRJob

I can specify what AWS credentials to use to create an EMR cluster via environment variables. However, I would like to run a mapreduce job on another AWS user's S3 bucket for which they gave me a different set of AWS credentials. Does MRJob provide…
Razzi Abuissa
  • 3,337
  • 2
  • 28
  • 29
1
vote
1 answer

Using a combiner in hadoop streaming mapreduce (using mrjob)

When I was taught about mapreduce one of the key components was the combiner. It is a step between the mapper and the reducer which essentially runs the reducer at the end of the map phase in order to decrease the number of lines of data that the…
Narek
  • 548
  • 6
  • 26
1
vote
1 answer

how to run mrjob on EMR

I tried to run mapreduce by following this tutorial. I uploaded the files mrjob.conf, readme.txt and word_count.py on EC2 instance in the folder ~/hello_mapreduce and tryed to run the command: python word_count.py -r emr README.txt which returned…
Niko Gamulin
  • 66,025
  • 95
  • 221
  • 286
1
vote
1 answer

Decompress + un-tar input files during mrjob execution

I would like to process lots of data in S3 efficient with mrjob (using EMR). I can structure the data any way I would like, but clearly I would like to do everything I can to play to the strengths of having EMR run on S3 data. My data consists of…
user2013116
  • 31
  • 1
  • 3
1
vote
1 answer

How do you filter s3 files before sending input to mrjob mapper?

I'm trying to MapReduce logs, and I'd like to filter all logs in a bucket by filename before processing them in EMR. Also, some files are tar directories, and I'd like mrjob to uncompress it, then filter files in it to only parse the relevant…
Adrien Lemaire
  • 1,744
  • 2
  • 20
  • 29
1
vote
2 answers

Bootstrapping libraries on EMR using python MRJob

Problem Statement: I am trying to run a map-reduce job in Amazon EMR using python MRJob library, and I am having trouble with bootstrapping the nodes with the requisite libraries and packages. Details: my sample python mrjob code: import re …
Shreyas
  • 367
  • 3
  • 9
1
vote
1 answer

Is it possible to process multi-line records using Hadoop Streaming?

I have records like this: Name: Alan Kay Email: Alan.Kay@url.com Date: 09-09-2013 Name: Marvin Minsky Email: Marvin.Minsky@url.com City: Boston, MA Date: 09-10-2013 Name: Alan Turing City: New York City, NY Date: 09-10-2013 They're multiline but…
duber
  • 2,769
  • 4
  • 24
  • 32