Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

votes

1 answer

Mrjob in hadoop mode: Error launching job , bad input path : File does not exist

I'm trying to run the Mrjob example from the book Hadoop with Python on my laptop, in pseudo distributed mode. (the file salaries.csv can be found here) So I can start the namenode and the datanode: start-dfs.sh returns: Starting namenodes on…

python ubuntu hadoop mrjob

asked Dec 24 '16 at 15:04

user189035

5,589
13
52
112

votes

0 answers

mrjob join non unique key

Using mrjob, I want to map a key of table_1: a to values of x and y from table_2 and table_3, i.e z and was shown in output. I write some code mrjob combiner not working python which gives output as a1 x1-x2-y1 a2 y1 But how to inner join the…

python join mapreduce mrjob

asked Dec 13 '16 at 15:44

piyush-balwani

votes

3 answers

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can…

json dataset yelp mrjob

asked Feb 23 '16 at 21:26

Bengi Koseoglu

votes

0 answers

MapReduce - Iterating over keys and values in a reducer

I am having trouble understanding how to iterate over values. I have a mapper which will pass in something like: (cat, *): 5 (cat, *): 5 (cat, dog): 1 (pigeon, dog): 1 (hello, world): 1 (cat, dog): 1 (pigeon, dog): 1 (hello, world): 1 I am trying…

python hadoop mapreduce mrjob

asked Jan 09 '16 at 22:32

trixie

votes

1 answer

Error installing mrjob on Mac (OS X 10.11.1)

Typing in Terminal pip install mrjob gives the error message: "NameError: name 'execfile' is not defined" and "Command "python setup.py egg_info" failed with error code 1 in /private..." Using sudo pip install mrjob also gives the same error…

python macos python-3.x osx-elcapitan mrjob

asked Nov 24 '15 at 15:43

Stuart Jeckel

votes

1 answer

Declare mrjob mapper without ignoring key

I want to declare a mapper function with mrjob. Because my mapper function needs to refer to some constants to do some calculations so I decide to put these constants into the Key in the mapper (Is there any other way?). I read mrjob tutorial on…

python hadoop mapreduce mrjob

asked Nov 16 '15 at 22:38

lenhhoxung

2,530
2
30
61

votes

2 answers

"Counters from Step 1: No Counters found" using Hadoop and mrjob

I have a python file to count bigrams using mrjob up on Hadoop (version 2.6.0), but I'm not getting the output that I'm hoping for and I'm having trouble deciphering the output in my terminal for where I'm going wrong. My code: regex_for_words =…

python python-2.7 hadoop mapreduce mrjob

asked Oct 25 '15 at 22:14

moskemerak

votes

1 answer

Giving Comomn crawl location as input to Amazon EMR using mrjob python

It has been only days since I started using mrjob and I have tried certain low and medium level tasks.Now I am stuck at giving Common crawl [now onwards will be know as CC] location as input to emr using python mrjob My config file looks like this…

python amazon-web-services emr mrjob common-crawl

asked Sep 27 '15 at 19:47

The6thSense

8,103
8
31
65

votes

1 answer

Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?

It seems like the nature of the MapReduce framework is to work with many files. So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong. If I run the job with the inline runner and three directories, it…

python mrjob

asked Jun 04 '15 at 17:54

numbers are fun

votes

1 answer

Passing result of mrjob step to next step as parameter

I am writing a multi-step mrjob. The first step does some pre-processing and ends with the following reducer: def some_reducer(self, key, values): values = (int (value) for value in values) if key == 'iwantthiskey': //I want to pass…

python mapreduce mrjob

asked May 09 '15 at 08:49

Martin Boyanov

votes

1 answer

How to set the number of parallel reducers on EMR?

I am running a job on EMR with mrjob; I am using AMI version 2.4.7 and Hadoop version 1.0.3. I want to specify the number of reducers for a job, because I want to provide a higher parallellism to the next one. Reading the answers to the other…

hadoop emr mrjob

asked Feb 26 '15 at 12:20

David Nemeskey

votes

1 answer

mapper_pre_filter in MRJob

I have been trying to tweek the mapper_pre_filter example given here. Now, if instead of specifying the command directly in steps, if I'm writing a method to return that command, like this: from mrjob.job import MRJob from mrjob.protocol import…

python mapreduce mrjob

asked Feb 10 '15 at 09:29

Saurabh Verma

6,328
12
52
84

votes

1 answer

How to set IAM role with MrJob 0.4.2 on EMR

I'm trying to set an IAM role to my EMR cluster with mrjob 0.4.2. I saw that there is a new option in 0.4.3 to do this, but it is still in development and I prefer to use the stable version instead. Any idea on how to do this? I have tried to create…

python emr mrjob

asked Sep 01 '14 at 10:53

Beka

votes

1 answer

Python hadoop mapreduce job using mrjob subprocess.CalledProcessError

I am trying to run basic example from the mrjob's website on my custom data. I have run Hadoop map reduce successfully using streaming, I have also successfully tried the script without Hadoop, but now I am trying to run it on Hadoop via mrjob by…

python hadoop mrjob

asked Aug 24 '14 at 20:09

ziky90

2,627
4
33
47

votes

1 answer

Error launching job using mrjob on Hadoop

I am new to hadoop and mrjob and this book really helped me a lot to learn. I was trying to run mrSVM.py on hadoop as it works fine locally. But I ran the following command:python mrSVM.py -r hadoop kickStart.txt and it gives the following error: no…

python python-2.7 hadoop mrjob

asked Aug 18 '14 at 08:07

Manvendra singh tomar

Prev 1 2 3

…

22 23 Next