Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

votes

0 answers

mrjob combiner not working python

Simple map combine reduce program: Map column-1 with value column-3 and append '+' in each mapper output of same key and append '-' after reduce output of same key. input_1 and input_2 both files contain a 1 2 3 a 4 5 6 Code is from mrjob.job…

python mapreduce mrjob

asked Dec 13 '16 at 09:58

piyush-balwani

votes

2 answers

MapReduce job to yield top 10 values using Python's MRjob

I want this map reduce job (code below) to output the top 10 most rated products. It keeps giving me the following error message: it = izip(iterable, count(0,-1)) # decorate TypeError: izip argument #1 must support iteration. I'm…

python mapreduce mrjob

asked Nov 29 '16 at 16:26

Ije

votes

0 answers

Load JSON into MrJob - Python

I've been trying to load a JSON data file into mrjob, but can't really get it to work. from mrjob.job import MRJob from mrjob.protocol import JSONProtocol def type_hashing(entry): return entry[13].lower() class ReduceData(MRJob): …

python json mapreduce mrjob

asked Nov 18 '16 at 14:25

Syspect

votes

2 answers

Utilize multi-core with LocalMRJobRunner for MRJob

I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with…

python mrjob

asked Jul 02 '15 at 10:05

Andy

1,231
1
15
27

votes

1 answer

cannot run python MRJob locally

If i understand MRJob correctly, you can simulate hadoop's multi process run using MRJob by running it with python mrfile.py -r local input.txt I'm running windows(no choice for now), and when I issue the above command, i'm getting a bunch of mambo…

python mrjob

asked Jul 01 '15 at 21:22

user2773013

3,102
8
38
58

votes

0 answers

mrjob hanging forever when running in hadoop

I am running the tutorial in the doc and the word count is working for local files, but then I try python mr.py -r hadoop 1.txt Then it hangs. When I keyboard interrupt it, the log is: no configs found; falling back on auto-configuration no…

python mrjob

asked May 08 '15 at 19:56

kevin ding

votes

2 answers

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?

We are currently using Luigi, MRJob and other frameworks to run Hadoo streaming jobs using Python. We are already able to ship the jobs with its own virtualenv so no specific Python dependencies are installed in the nodes (see the article). I was…

python hadoop anaconda mrjob

asked Apr 23 '15 at 14:53

mfcabrera

votes

1 answer

Where does sys.stdout.write() go to in MRJOB mapper?

mrjob.conf runners: emr: aws_access_key_id: ** aws_secret_access_key: ** aws_region: us-east-1 aws_availability_zone: us-east-1a ec2_key_pair: scrapers2 ec2_key_pair_file: ~/arachnid.pem ec2_instance_type: c3.8xlarge …

python emr mrjob

asked Apr 02 '15 at 19:48

birdnerd

votes

1 answer

Can I use mrjob python library on partitioned hive tables?

I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally…

python hadoop streaming hive mrjob

asked Sep 17 '14 at 11:57

Tomasz Sosiński

votes

0 answers

kmeans based on mapreduce by python

I am going to write a mapper and reducer for the kmeans algorithm, I think the best course of action to do is putting the distance calculator in mapper and sending to reducer with the cluster id as key and coordinates of row as value. In reducer,…

python hadoop mrjob

asked Jun 10 '14 at 09:15

Amin Mohebi

votes

2 answers

python map reduce simple wordcount in cyrillic text

I'm trying to implement a very basic wordcount example with MRJob. Everything works fine with ascii input, but when I mix cyrillic words into the input, I get something like this as an output "\u043c\u0438\u0440" 1 "again!" 1 "hello" 2 "world"…

python cyrillic mrjob

asked Feb 22 '14 at 14:49

Anton

votes

1 answer

MRJob: socket.error: [Errno 104] Connection reset by peer

In short: "socket.error: [Errno 104] Connection reset by peer" exception while using MRJob. The script actually has access to S3 because it does create buckets and uploads some small files (I've checked manually via AWS console). But the largest…

python sockets amazon-s3 amazon-emr mrjob

asked Dec 16 '13 at 15:36

Spaceman

1,185
4
17
31

votes

2 answers

Passing parameters to reducer in MRjob

I am using MRjob to run Hadoop Streaming jobs over our HBase instance. For the life of me I cannot figure out how to pass a parameter to my reducer. I have two parameters that I want to pass to my reducer from when I run the job: startDate and…

python mapreduce mrjob

asked Aug 01 '13 at 20:44

Sourav Dey

votes

3 answers

How do I write the output of an EMR streaming job to HDFS?

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR streaming job must be an S3 bucket. When I…

python hadoop emr mrjob

asked May 08 '13 at 04:27

Abe

22,738
26
82
111

votes

2 answers

How to debug python MapReduce programs written in mrjob from eclipse

I am trying to debug mapreduce jobs written in python's mrjob library using eclipse under Ubuntu. Does anyone have an idea how this could be done?

python eclipse mapreduce mrjob

asked Dec 11 '12 at 12:12

pacodelumberg

2,214
4
25
32

Prev 1 2

…

22 23 Next