Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
2
votes
1 answer

Mrjob in hadoop mode: Error launching job , bad input path : File does not exist

I'm trying to run the Mrjob example from the book Hadoop with Python on my laptop, in pseudo distributed mode. (the file salaries.csv can be found here) So I can start the namenode and the datanode: start-dfs.sh returns: Starting namenodes on…
user189035
  • 5,589
  • 13
  • 52
  • 112
2
votes
0 answers

mrjob join non unique key

Using mrjob, I want to map a key of table_1: a to values of x and y from table_2 and table_3, i.e z and was shown in output. I write some code mrjob combiner not working python which gives output as a1 x1-x2-y1 a2 y1 But how to inner join the…
piyush-balwani
  • 524
  • 3
  • 15
2
votes
3 answers

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can…
Bengi Koseoglu
  • 159
  • 4
  • 10
2
votes
0 answers

MapReduce - Iterating over keys and values in a reducer

I am having trouble understanding how to iterate over values. I have a mapper which will pass in something like: (cat, *): 5 (cat, *): 5 (cat, dog): 1 (pigeon, dog): 1 (hello, world): 1 (cat, dog): 1 (pigeon, dog): 1 (hello, world): 1 I am trying…
trixie
  • 33
  • 6
2
votes
1 answer

Error installing mrjob on Mac (OS X 10.11.1)

Typing in Terminal pip install mrjob gives the error message: "NameError: name 'execfile' is not defined" and "Command "python setup.py egg_info" failed with error code 1 in /private..." Using sudo pip install mrjob also gives the same error…
2
votes
1 answer

Declare mrjob mapper without ignoring key

I want to declare a mapper function with mrjob. Because my mapper function needs to refer to some constants to do some calculations so I decide to put these constants into the Key in the mapper (Is there any other way?). I read mrjob tutorial on…
lenhhoxung
  • 2,530
  • 2
  • 30
  • 61
2
votes
2 answers

"Counters from Step 1: No Counters found" using Hadoop and mrjob

I have a python file to count bigrams using mrjob up on Hadoop (version 2.6.0), but I'm not getting the output that I'm hoping for and I'm having trouble deciphering the output in my terminal for where I'm going wrong. My code: regex_for_words =…
moskemerak
  • 99
  • 1
  • 8
2
votes
1 answer

Giving Comomn crawl location as input to Amazon EMR using mrjob python

It has been only days since I started using mrjob and I have tried certain low and medium level tasks.Now I am stuck at giving Common crawl [now onwards will be know as CC] location as input to emr using python mrjob My config file looks like this…
The6thSense
  • 8,103
  • 8
  • 31
  • 65
2
votes
1 answer

Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?

It seems like the nature of the MapReduce framework is to work with many files. So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong. If I run the job with the inline runner and three directories, it…
numbers are fun
  • 423
  • 1
  • 7
  • 12
2
votes
1 answer

Passing result of mrjob step to next step as parameter

I am writing a multi-step mrjob. The first step does some pre-processing and ends with the following reducer: def some_reducer(self, key, values): values = (int (value) for value in values) if key == 'iwantthiskey': //I want to pass…
Martin Boyanov
  • 416
  • 3
  • 13
2
votes
1 answer

How to set the number of parallel reducers on EMR?

I am running a job on EMR with mrjob; I am using AMI version 2.4.7 and Hadoop version 1.0.3. I want to specify the number of reducers for a job, because I want to provide a higher parallellism to the next one. Reading the answers to the other…
David Nemeskey
  • 640
  • 1
  • 5
  • 16
2
votes
1 answer

mapper_pre_filter in MRJob

I have been trying to tweek the mapper_pre_filter example given here. Now, if instead of specifying the command directly in steps, if I'm writing a method to return that command, like this: from mrjob.job import MRJob from mrjob.protocol import…
Saurabh Verma
  • 6,328
  • 12
  • 52
  • 84
2
votes
1 answer

How to set IAM role with MrJob 0.4.2 on EMR

I'm trying to set an IAM role to my EMR cluster with mrjob 0.4.2. I saw that there is a new option in 0.4.3 to do this, but it is still in development and I prefer to use the stable version instead. Any idea on how to do this? I have tried to create…
Beka
  • 725
  • 6
  • 22
2
votes
1 answer

Python hadoop mapreduce job using mrjob subprocess.CalledProcessError

I am trying to run basic example from the mrjob's website on my custom data. I have run Hadoop map reduce successfully using streaming, I have also successfully tried the script without Hadoop, but now I am trying to run it on Hadoop via mrjob by…
ziky90
  • 2,627
  • 4
  • 33
  • 47
2
votes
1 answer

Error launching job using mrjob on Hadoop

I am new to hadoop and mrjob and this book really helped me a lot to learn. I was trying to run mrSVM.py on hadoop as it works fine locally. But I ran the following command:python mrSVM.py -r hadoop kickStart.txt and it gives the following error: no…