Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
1
vote
1 answer

Is it possible to add additional input to a later step of an mrjob?

I have an mrjob that consists of 3 steps. The second step expects as input the results of the first step plus some more content from S3. I understand that I can always "stream" it through the first step, meaning emit is as is, and only use it in the…
Eleni
  • 645
  • 6
  • 19
1
vote
1 answer

Change Mapreduce intermediate output location using MRJob

I am trying to run a python script using MRJob on a cluster in which I don't have admin permissions and I got the error pasted below. What I think is happening is that the job is trying to write the intermediate files to the default /tmp.... dir and…
anonuser0428
  • 11,789
  • 22
  • 63
  • 86
1
vote
1 answer

MRJob error while running on hadoop cluster

I am trying to run a python job using a hadoop cluster and MRJob and my wrapper script is as follows: #!/bin/bash . /etc/profile module load use.own module load python/python2.7 module load python/mrjob python…
anonuser0428
  • 11,789
  • 22
  • 63
  • 86
1
vote
1 answer

Map-Reduce/Hadoop sort by integer value (using MRJob)

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py: from mrjob.job import MRJob class Beta(MRJob): def mapper(self, _, line): """ """ l = line.split(' ') yield l[1], l[0] …
p0lAris
  • 4,750
  • 8
  • 45
  • 80
1
vote
2 answers

Elastic Map Reduce Error

I am getting an error when using Elastic Map Reduce and I am not sure what it means because it is not very descriptive. I want to know specifically what kind of JSONDecodeError I am getting. "12" is not descriptive. This is the output. I am using…
user1011332
  • 773
  • 12
  • 27
1
vote
2 answers

Share specific data between each mapper

I would like to add a specific subset of records to be merged with each chunk of records at each mapper, How can I do this in Hadoop generally? and in Python streaming package mrJob?
Ahmed Elmorsy
  • 564
  • 3
  • 8
  • 18
1
vote
1 answer

Input file for local MRJobs

I am learning/testing mrjobs on my laptop, using the wordcount example. I am able to provide a local file as input in command mode but don't know how to do the same thing from within the python script. Greatly appreciate a simple…
akrishnamo
  • 449
  • 1
  • 3
  • 15
1
vote
1 answer

How to optimize this MapReduce function, Python, mrjob

I'm very new to Map/Reduce principles and python mrjob framework, I wrote this sample code, and it works fine, but I would like to know what can I change in it to make it "perfect" / more efficient. from mrjob.job import MRJob import operator import…
Vor
  • 33,215
  • 43
  • 135
  • 193
1
vote
1 answer

How to run a final 'print' statement once in a multi-step map-reduce program?

I am basically trying to implement a recommender system by scaling it up on Hadoop. In the first step, I am trying to calculate the similarity between every pair of items in the input file.If I store it simply as {Item A,Item B,Similarity} the…
Atanu
  • 61
  • 2
  • 12
1
vote
2 answers

How to calculate correlation between two variables in python using MapReduce

I am trying to use the Million Song Dataset available on AWS to find the correlation between the loudness of a track and its popularity. I followed a basic tutorial (http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/)…
Prajoth
  • 900
  • 3
  • 12
  • 26
1
vote
1 answer

How to change environment variables in mrjob for AWS accesskey and secretaccesskey

How do I change the $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY in mrjob to enter my own credentials for AWS? I am using the Terminal on Mac OS X. https://github.com/Yelp/mrjob Thanks!
Prajoth
  • 900
  • 3
  • 12
  • 26
1
vote
1 answer

MapReduce: Mrjob Saving Results Persistently

I am trying to implement a mapreduce job with three steps and after each step I need the data from all of the steps so far. Does anyone have an example/idea about how I can save results of mapper or reducers to disk in mrjob?
MHardy
  • 491
  • 3
  • 7
  • 17
1
vote
0 answers

random java.io.FileNotFoundException jobcache error on EMR with MrJob

I am using MrJob and trying to run a Hadoop job on Elastic Map Reduce which keeps crashing at random. The data looks like this (tab separated): 279391888 261151291 107.303163 35.468534 279391888 261115099 108.511726 …
Max Shron
  • 946
  • 7
  • 6
1
vote
1 answer

Python: Increasing timeout value in EMR using yelps MRJOB

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error…
Read Q
  • 1,405
  • 2
  • 14
  • 26
1
vote
2 answers

How to bundle custom hadoop-streaming.jar

I'm trying to use the CombineFileInputFormat class using Yelp's MrJob tool for EMR. The jobflow is created using hadoop streaming, and MrJob's documentation indicates the CombineFileInputFormat class must be bundled in a customized…
vladimir montealegre
  • 2,010
  • 2
  • 15
  • 17