Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
4
votes
3 answers

Python Module Import Error "ImportError: No module named mrjob.job"

System: Mac OSX 10.6.5, Python 2.6 I try to run the python script below: from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word,…
worker1138
  • 2,071
  • 5
  • 29
  • 36
4
votes
0 answers

mrjob with JSON data

Me and a friend of mine are working on a rather large JSON file. We want to perform MapReduce on parts of this file, being as speedy as possible. As it appears to be hard to feed a JSON file directly into a "mrjob job", we attempted to write the…
Superdids
  • 77
  • 1
  • 7
4
votes
2 answers

Getting error while running django_cron

When am trying to run the chron job in django using below command python manage.py runcrons its showing one error like below $ python manage.py runcrons No handlers could be found for logger "django_cron" Does any one have any idea about this…
Akshath Kumar
  • 489
  • 2
  • 6
  • 15
4
votes
1 answer

How to specifically determine input for each map step in MRJob?

I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to. What I want is to extract some information and use it in second step against all input and so on.…
Mehraban
  • 3,164
  • 4
  • 37
  • 60
4
votes
1 answer

Why I got "WindowsError [Error5] Access is denied" when run python file using mrjob

I'm trying to use mrjob in a python file and run it in the command line, but I'm keeping getting the error log saying: C:\Users\Ni\Desktop>python si601lab6_sol.py pg1268.txt no configs found; falling back on auto-configuration no configs found;…
Ni Yan
  • 165
  • 1
  • 4
  • 9
4
votes
1 answer

How can I use s3 object names as inputs to an MRJob mapper, but not the s3 objects themselves?

I'm missing something obvious about Yelp's mrjob job library. Setting up an MRJob class is almost trivially easy. Running it over a file or stdin also so. But how can I change the input to the job from a file either locally or in s3, to, say, keys…
Christopher
  • 42,720
  • 11
  • 81
  • 99
4
votes
1 answer

MRJob :- Display intermediate values in map reduce

How can I display intermediate values (i.e print a variable or a list ) on the terminal while running the mapreduce program using python MRJob library?
Read Q
  • 1,405
  • 2
  • 14
  • 26
4
votes
4 answers

How does mapreduce sort and shuffle work?

I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase (1, 24)…
Read Q
  • 1,405
  • 2
  • 14
  • 26
4
votes
2 answers

How can I allot more memory to Python program? Its not consuming more than 64MB on 4GB RAM

I have a Python program running on some input data on 4GB RAM 32-bit 12.04 Ubuntu. The time and space complexity of the program both are O(n). When input data is around 100 kb it completes the execution in about 4sec with peak RAM consumption being…
user1403483
4
votes
1 answer

Can mrjob tasks output sets?

I tried outputting a python set from a mapper in mrjob. I changed the function signatures of my combiners and reducers accordingly. However, I get this error: Counters From Step 1 Unencodable output: TypeError: 172804 When I change the sets to…
dangerChihuahua007
  • 20,299
  • 35
  • 117
  • 206
3
votes
1 answer

Minimum AWS policy requirements to run an EMR job

I'd like to run an Elastic Mapreduce on data from the S3 bucket com.test.mybucket, using the MRJob Python framework. However I have lots of other data in S3, and other EC2 instances that I don't want to touch. What is the minimum possible set of…
Kevin Burke
  • 61,194
  • 76
  • 188
  • 305
3
votes
1 answer

MapReduce pairwise comparison of all lines in multiple files

I'm getting started with using python's mrjob to convert some of my long running python programs into MapReduce hadoop jobs. I've gotten the simple word count examples to work and I conceptually understand the 'text-classification' example. However,…
JudoWill
  • 4,741
  • 2
  • 36
  • 48
3
votes
0 answers

Problem when using SORT_VALUES in a MapReduce job using mrjob (key-values are not sorted in the reducer input)

I want to create a MapReduce program whose reduce receives k-v pairs sorted by the value. I'm using mrjob, whose SORT_VALUES parameter seemed to be ideal for the task. After setting this parameter to True, the reducer input is not sorted, for…
3
votes
1 answer

MRJob sort reducer output

Is there any way to sort the output of reducer function using mrjob? I think that the input to reducer function is sorted by the key and I tried to exploit this feature to sort the output using another reducer like below where I know values have…
Dandelion
  • 744
  • 2
  • 13
  • 34
3
votes
1 answer

python mapreduce - Skipping the first line of the .csv in mapper

I am trying to do mapreduce in python and my csv file looks like below, trip_id taxi_id pickup_time dropoff_time ... total 0 20117 2455.0 2013-05-05 09:45:00 50.44 1 44691 1779.0 2013-06-24 11:30:00 66.78 and my…
TTaa
  • 331
  • 5
  • 12
1
2
3
22 23