Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
3
votes
0 answers

mrjob combiner not working python

Simple map combine reduce program: Map column-1 with value column-3 and append '+' in each mapper output of same key and append '-' after reduce output of same key. input_1 and input_2 both files contain a 1 2 3 a 4 5 6 Code is from mrjob.job…
piyush-balwani
  • 524
  • 3
  • 15
3
votes
2 answers

MapReduce job to yield top 10 values using Python's MRjob

I want this map reduce job (code below) to output the top 10 most rated products. It keeps giving me the following error message: it = izip(iterable, count(0,-1)) # decorate TypeError: izip argument #1 must support iteration. I'm…
Ije
  • 43
  • 1
  • 7
3
votes
0 answers

Load JSON into MrJob - Python

I've been trying to load a JSON data file into mrjob, but can't really get it to work. from mrjob.job import MRJob from mrjob.protocol import JSONProtocol def type_hashing(entry): return entry[13].lower() class ReduceData(MRJob): …
Syspect
  • 921
  • 7
  • 22
  • 50
3
votes
2 answers

Utilize multi-core with LocalMRJobRunner for MRJob

I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with…
Andy
  • 1,231
  • 1
  • 15
  • 27
3
votes
1 answer

cannot run python MRJob locally

If i understand MRJob correctly, you can simulate hadoop's multi process run using MRJob by running it with python mrfile.py -r local input.txt I'm running windows(no choice for now), and when I issue the above command, i'm getting a bunch of mambo…
user2773013
  • 3,102
  • 8
  • 38
  • 58
3
votes
0 answers

mrjob hanging forever when running in hadoop

I am running the tutorial in the doc and the word count is working for local files, but then I try python mr.py -r hadoop 1.txt Then it hangs. When I keyboard interrupt it, the log is: no configs found; falling back on auto-configuration no…
kevin ding
  • 39
  • 2
3
votes
2 answers

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?

We are currently using Luigi, MRJob and other frameworks to run Hadoo streaming jobs using Python. We are already able to ship the jobs with its own virtualenv so no specific Python dependencies are installed in the nodes (see the article). I was…
mfcabrera
  • 781
  • 10
  • 26
3
votes
1 answer

Where does sys.stdout.write() go to in MRJOB mapper?

mrjob.conf runners: emr: aws_access_key_id: ** aws_secret_access_key: ** aws_region: us-east-1 aws_availability_zone: us-east-1a ec2_key_pair: scrapers2 ec2_key_pair_file: ~/arachnid.pem ec2_instance_type: c3.8xlarge …
birdnerd
  • 111
  • 2
  • 4
3
votes
1 answer

Can I use mrjob python library on partitioned hive tables?

I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally…
Tomasz Sosiński
  • 849
  • 1
  • 10
  • 12
3
votes
0 answers

kmeans based on mapreduce by python

I am going to write a mapper and reducer for the kmeans algorithm, I think the best course of action to do is putting the distance calculator in mapper and sending to reducer with the cluster id as key and coordinates of row as value. In reducer,…
Amin Mohebi
  • 194
  • 1
  • 2
  • 14
3
votes
2 answers

python map reduce simple wordcount in cyrillic text

I'm trying to implement a very basic wordcount example with MRJob. Everything works fine with ascii input, but when I mix cyrillic words into the input, I get something like this as an output "\u043c\u0438\u0440" 1 "again!" 1 "hello" 2 "world"…
Anton
  • 66
  • 6
3
votes
1 answer

MRJob: socket.error: [Errno 104] Connection reset by peer

In short: "socket.error: [Errno 104] Connection reset by peer" exception while using MRJob. The script actually has access to S3 because it does create buckets and uploads some small files (I've checked manually via AWS console). But the largest…
Spaceman
  • 1,185
  • 4
  • 17
  • 31
3
votes
2 answers

Passing parameters to reducer in MRjob

I am using MRjob to run Hadoop Streaming jobs over our HBase instance. For the life of me I cannot figure out how to pass a parameter to my reducer. I have two parameters that I want to pass to my reducer from when I run the job: startDate and…
Sourav Dey
  • 387
  • 1
  • 4
  • 14
3
votes
3 answers

How do I write the output of an EMR streaming job to HDFS?

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR streaming job must be an S3 bucket. When I…
Abe
  • 22,738
  • 26
  • 82
  • 111
3
votes
2 answers

How to debug python MapReduce programs written in mrjob from eclipse

I am trying to debug mapreduce jobs written in python's mrjob library using eclipse under Ubuntu. Does anyone have an idea how this could be done?
pacodelumberg
  • 2,214
  • 4
  • 25
  • 32
1 2
3
22 23