Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

vote

1 answer

Is it possible to add additional input to a later step of an mrjob?

I have an mrjob that consists of 3 steps. The second step expects as input the results of the first step plus some more content from S3. I understand that I can always "stream" it through the first step, meaning emit is as is, and only use it in the…

emr mrjob

asked Apr 01 '14 at 15:05

Eleni

vote

1 answer

Change Mapreduce intermediate output location using MRJob

I am trying to run a python script using MRJob on a cluster in which I don't have admin permissions and I got the error pasted below. What I think is happening is that the job is trying to write the intermediate files to the default /tmp.... dir and…

python hadoop mapreduce hadoop-streaming mrjob

asked Dec 15 '13 at 01:03

anonuser0428

11,789
22
63
86

vote

1 answer

MRJob error while running on hadoop cluster

I am trying to run a python job using a hadoop cluster and MRJob and my wrapper script is as follows: #!/bin/bash . /etc/profile module load use.own module load python/python2.7 module load python/mrjob python…

python hadoop cluster-computing hadoop-streaming mrjob

asked Dec 14 '13 at 23:25

anonuser0428

11,789
22
63
86

vote

1 answer

Map-Reduce/Hadoop sort by integer value (using MRJob)

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py: from mrjob.job import MRJob class Beta(MRJob): def mapper(self, _, line): """ """ l = line.split(' ') yield l[1], l[0] …

python sorting hadoop mapreduce mrjob

asked Nov 23 '13 at 00:15

p0lAris

4,750
8
45
80

vote

2 answers

Elastic Map Reduce Error

I am getting an error when using Elastic Map Reduce and I am not sure what it means because it is not very descriptive. I want to know specifically what kind of JSONDecodeError I am getting. "12" is not descriptive. This is the output. I am using…

python json elastic-map-reduce mrjob

asked Jul 16 '13 at 17:20

user1011332

vote

2 answers

Share specific data between each mapper

I would like to add a specific subset of records to be merged with each chunk of records at each mapper, How can I do this in Hadoop generally? and in Python streaming package mrJob?

python hadoop mapreduce hadoop-streaming mrjob

asked Jun 06 '13 at 14:49

Ahmed Elmorsy

vote

1 answer

Input file for local MRJobs

I am learning/testing mrjobs on my laptop, using the wordcount example. I am able to provide a local file as input in command mode but don't know how to do the same thing from within the python script. Greatly appreciate a simple…

mrjob

asked May 31 '13 at 08:50

akrishnamo

vote

1 answer

How to optimize this MapReduce function, Python, mrjob

I'm very new to Map/Reduce principles and python mrjob framework, I wrote this sample code, and it works fine, but I would like to know what can I change in it to make it "perfect" / more efficient. from mrjob.job import MRJob import operator import…

python hadoop mapreduce mrjob

asked Apr 05 '13 at 20:29

Vor

33,215
43
135
193

vote

1 answer

How to run a final 'print' statement once in a multi-step map-reduce program?

I am basically trying to implement a recommender system by scaling it up on Hadoop. In the first step, I am trying to calculate the similarity between every pair of items in the input file.If I store it simply as {Item A,Item B,Similarity} the…

python hadoop mapreduce collaborative-filtering mrjob

asked Mar 05 '13 at 13:35

Atanu

vote

2 answers

How to calculate correlation between two variables in python using MapReduce

I am trying to use the Million Song Dataset available on AWS to find the correlation between the loudness of a track and its popularity. I followed a basic tutorial (http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/)…

python amazon-web-services mapreduce bigdata mrjob

asked Feb 18 '13 at 17:22

Prajoth

vote

1 answer

How to change environment variables in mrjob for AWS accesskey and secretaccesskey

How do I change the $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY in mrjob to enter my own credentials for AWS? I am using the Terminal on Mac OS X. https://github.com/Yelp/mrjob Thanks!

python hadoop amazon-ec2 mapreduce mrjob

asked Feb 17 '13 at 17:15

Prajoth

vote

1 answer

MapReduce: Mrjob Saving Results Persistently

I am trying to implement a mapreduce job with three steps and after each step I need the data from all of the steps so far. Does anyone have an example/idea about how I can save results of mapper or reducers to disk in mrjob?

python mapreduce mrjob

asked Feb 06 '13 at 13:51

MHardy

vote

0 answers

random java.io.FileNotFoundException jobcache error on EMR with MrJob

I am using MrJob and trying to run a Hadoop job on Elastic Map Reduce which keeps crashing at random. The data looks like this (tab separated): 279391888 261151291 107.303163 35.468534 279391888 261115099 108.511726 …

python hadoop emr mrjob

asked Jan 22 '13 at 03:16

Max Shron

vote

1 answer

Python: Increasing timeout value in EMR using yelps MRJOB

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error…

python hadoop mapreduce elastic-map-reduce mrjob

asked Jan 17 '13 at 15:25

Read Q

1,405
2
14
26

vote

2 answers

How to bundle custom hadoop-streaming.jar

I'm trying to use the CombineFileInputFormat class using Yelp's MrJob tool for EMR. The jobflow is created using hadoop streaming, and MrJob's documentation indicates the CombineFileInputFormat class must be bundled in a customized…

java hadoop streaming mrjob

asked Jan 11 '13 at 20:33

vladimir montealegre

2,010
2
15
17

Prev 1 2 3

…

22 23 Next