Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
1 answer

using sqlite3dbm with mrjob for map reduce

I have a sqlite3dbm which store data in key-value pair. I need to process it using mrjob. When I run my code xyz.py my_db.db, mapper fn doesn't work properly. def mapper(k,val): for word in val: yield(word,k) I get null for k
user1525721
  • 336
  • 5
  • 12
0
votes
1 answer

"The location specified by MRJOB_CONF" in mrjob documentation

Which path is "The location specified by MRJOB_CONF" in mrjob documentation? Link to mrjob doc: http://mrjob.readthedocs.org/en/latest/guides/configs-basics.html
user1403483
0
votes
1 answer

Some elementary doubts about running Mapreduce programs using mrjob on Amazon EMR

I am new to mrjob and I am having problems to get the job running on Amazon EMR. I will write them in sequential order. I can run a mrjob on my local machine. However when I have mrjob.conf in /home/ankit/.mrjob.conf and in /etc/mrjob.conf, the job…
user1403483
0
votes
1 answer

Import module in MRJob on EMR

Simple question: I have a module headers.py which defines a couple variables I need in my main MRJob script. I should be able to run the job with python MRMyJob -r emr --file=headers.py s3://input/data/path and then in my MRJob script (MRMyJob),…
Vyassa Baratham
  • 1,457
  • 12
  • 18
0
votes
2 answers

Error running python mrjob word count example

I'm trying to run the example word count map reduce task using mrjob. I get the following error: Traceback (most recent call last): File "mr.py", line 3, in from mrjob.job import MRJob File…
nickponline
  • 25,354
  • 32
  • 99
  • 167
0
votes
2 answers

hadoop with mrjob piping on shell

I have an issue regarding mrjob. I'm using an hadoopcluster over 3 datanodes using one namenode and one jobtracker. Starting with a nifty sample application I wrote something like the following first_script.py: for i in range(1,2000000): …
Mad Joker
  • 1
  • 1
-1
votes
1 answer

python with hadoop project: how to build a reducer to concatenate pairs of values

I have a small project with MapReduce and since I am new with this I am running into a lot of difficulties so would appreciate the help. In this project, I have a file that contains the nation, year, and weight. I want to find for each nation's year…
Frank shi
  • 25
  • 4
-1
votes
1 answer

I'm getting a list of lists in my reducer output rather than a paired value and I am unsure of what to change in my code

The code below is giving me nearly the output i want but not quite. def reducer(self, year, words): x = Counter(words) most_common = x.most_common(3) sorted(x, key=x.get, reverse=True) yield (year,…
CKZ
  • 37
  • 5
-1
votes
1 answer

How to sys.stderr.write into a json file in Python?

I am running a MapReduce job with mrjob library and I want to record the execution time to a json file. I record the time with this code: from datetime import datetime import sys if __name__ == '__main__': start_time = datetime.now() …
huy
  • 1,648
  • 3
  • 14
  • 40
-1
votes
2 answers

Python command line loop

I'm running a mrjob python script, and in the command line I can pass the number of cores for the system to use. python example_script.py --num-cores 5 I'm looking to run the script for n number of cores for beach marking performance test. IE: I…
F.D
  • 767
  • 2
  • 10
  • 23
-1
votes
1 answer

How to use mrjob.cat to auto-decompress inputs?

I want to use MrJob to analyze a dataset without decompressing it on disk beforehand (it is 18Gb compressed but >3Tb uncompressed). How can I use use mrjob.cat to auto-decompress the file and stream it to my mapper? There aren't any code samples.
crypdick
  • 16,152
  • 7
  • 51
  • 74
-1
votes
1 answer

How to integrate data with python code before running python program on command line

I have downloaded movielens dataset from that hyperlink ml-100k.zip (it is a movie and user information dataset and it is in the older dataset tab) and i have write the simple MapReduce code like below; from mrjob.job import MrJob class…
pcpcne
  • 43
  • 2
  • 11
-1
votes
2 answers

Performing a mapreduce function in Python

I'm trying to learn a little bit of mapreduce in combination with Python. Now I have the following code running from a tutorial I'm doing. from mrjob.job import MRJob class SpendByCustomer(MRJob): def mapper(self, _, line): …
John Dwyer
  • 189
  • 2
  • 13
-1
votes
1 answer

MRJob using a different Python interpreter for local vs. hadoop

I'm using MRJob on machine A to launch MapReduce jobs on machines B_0 thru B_10. The job has dependencies that require it to be run not with the default /bin/python (i.e. the output of which python on machine A) but with /path/to/weird/python, which…
Eli Rose
  • 6,788
  • 8
  • 35
  • 55
-1
votes
3 answers

How can I run mrjob with no input file?

I have a mrjob program, and just get data from sql database, so I don't need read local file or any input file, however mrjob forces me to 'reading from STDIN', so I just create an empty file as input file. It's really ugly, is there a way to run…
1 2 3
22
23