Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

votes

1 answer

Iterative kmeans based on mapreduce and hadoop

I have written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API. After each run of…

python hadoop mrjob

asked Jun 14 '14 at 11:04

Amin Mohebi

votes

2 answers

How to use avro files as input to a MRJob job?

I need to take avro files as input to a mrjob hadoop job. I can't find any documentation on how to do that unless I pass extra commands to the hadoop streaming jar. This will complicate development though because I've been using the inline runner to…

python hadoop hadoop-streaming mrjob

asked Mar 13 '14 at 10:15

jbrown

7,518
16
69
117

votes

2 answers

s3distcp error "Argument '--arg' doesn't match"

I'm trying to use s3distcp for an EMR job and got this exception: Exception in thread "main" java.lang.RuntimeException: Argument --arg doesn't match. at emr.hbase.options.Options.parseArguments(Options.java:75) at…

hadoop mapreduce elastic-map-reduce emr mrjob

asked Nov 03 '13 at 01:15

Thi Duong Nguyen

1,745
2
12
18

votes

1 answer

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability. During the above…

hadoop mapreduce mrjob

asked Oct 28 '13 at 21:38

VikBar

votes

2 answers

failed to use mapreduce in python

I am trying to learn mapreduce program using python mrjob. I am getting following error: Traceback: dumping stdin to local file /tmp/pyes_mrjob.testuser.20131004.103251.998597/STDIN Making directory…

python-2.7 mapreduce hadoop-streaming mrjob

asked Oct 04 '13 at 10:55

user2695817

votes

1 answer

processing LZO sequence files with mrjob

I'm writing a task with mrjob to compute various statistics using the Google Ngrams data: https://aws.amazon.com/datasets/8172056142375670 I developed & tested my script locally using an uncompressed subset of the data in tab-delimited text. Once I…

python hadoop lzo mrjob

asked Sep 18 '13 at 20:55

burr

votes

1 answer

python mrjob - gaierror: [Errno -2] Name or service not known

I'm trying to access s3 files from the mrjob module. Here's the code that's failing: from mrjob.emr import S3Filesystem fs = S3Filesystem("", "",…

python amazon-s3 emr mrjob

asked Jul 03 '13 at 22:04

Matt

votes

1 answer

mrjob: suppress key (or value) in reducer output

By default, mrJob stores the key and the value from output in key[tab]output format. This happens even if the key (or the value) is empty, null, or otherwise not interesting. Suppose my key, value pair is None, {"a":1", "b":1}. Then I get…

mrjob

asked Apr 25 '13 at 18:19

Abe

22,738
26
82
111

votes

2 answers

MRJOB open JSON file - Python

I am trying to load a json file as part of the mapper function but it returns "No such file in directory" although the file is existent. I am already opening a file and parsing through its lines. But want to compare some of its values to a second…

python mrjob

asked Apr 09 '13 at 12:58

Nicolas Hung

votes

2 answers

Access distributed cache from MrJob

I am writing hadoop app using MrJob. I need to use distributed cache to access to some files. I know that there is an option -files in hadoop streaming but don't know how to access it in the program. Thanks for your help.

python hadoop mrjob

asked Apr 08 '13 at 12:49

user2257622

votes

2 answers

Is there a way to specify the title of a job from mrjob in the Hadoop Administration web interface?

I have several different jobs started from the Python library mrjob, including jobs with multiple steps. How can I replace streamjob with a custom name? For example, wordcount_step_1, wordcount_step_2, etc.

python mapreduce hadoop-streaming mrjob

asked Mar 20 '13 at 22:36

gak

32,061
28
119
154

votes

3 answers

MRjob: Can a reducer perform 2 operations?

I am trying to yield the probability each key,value pair generated from mapper has. So, lets say mapper yields: a, (r, 5) a, (e, 6) a, (w, 7) I need to add 5+6+7 = 18 and then find probabilities 5/18, 6/18, 7/18 so the final output from the reducer…

python mapreduce mrjob

asked Feb 24 '13 at 11:06

Nicolas Hung

votes

1 answer

MRJob and mapreduce task partitioning over Hadoop

I am trying to perform a mapreduce job using the Python MRJob lib and am having some issues getting it to properly distribute across my Hadoop cluster. I believe I am simply missing a basic principle of mapreduce. My cluster is a small, one master…

hadoop mapreduce mrjob

asked Jan 02 '13 at 08:26

acnutch

votes

1 answer

Unicode files with mrjob

I'm attempting to run a basic character count using mrjob. The file is a unicode UTF-8 text document that contains, among other symbols, Chinese characters. When I run the character count, I only get counts of symbols in the ASCII character set…

python unicode mrjob

asked Dec 05 '12 at 22:50

DevinRB

vote

1 answer

How to steps differences reduce in Hadoop?

How to steps differences reduce in Hadoop? I have a problem with understand Hadoop. I have two files and first I did a join between those files. One file is about countries and the other is about client in each country. Example, clients.csv: Bertram…

python hadoop mapreduce mrjob

asked Nov 12 '22 at 18:07

DANIEL

Prev 1 2 3

…

22 23 Next