Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
2
votes
1 answer

Iterative kmeans based on mapreduce and hadoop

I have written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API. After each run of…
Amin Mohebi
  • 194
  • 1
  • 2
  • 14
2
votes
2 answers

How to use avro files as input to a MRJob job?

I need to take avro files as input to a mrjob hadoop job. I can't find any documentation on how to do that unless I pass extra commands to the hadoop streaming jar. This will complicate development though because I've been using the inline runner to…
jbrown
  • 7,518
  • 16
  • 69
  • 117
2
votes
2 answers

s3distcp error "Argument '--arg' doesn't match"

I'm trying to use s3distcp for an EMR job and got this exception: Exception in thread "main" java.lang.RuntimeException: Argument --arg doesn't match. at emr.hbase.options.Options.parseArguments(Options.java:75) at…
Thi Duong Nguyen
  • 1,745
  • 2
  • 12
  • 18
2
votes
1 answer

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability. During the above…
VikBar
  • 21
  • 2
2
votes
2 answers

failed to use mapreduce in python

I am trying to learn mapreduce program using python mrjob. I am getting following error: Traceback: dumping stdin to local file /tmp/pyes_mrjob.testuser.20131004.103251.998597/STDIN Making directory…
user2695817
  • 121
  • 1
  • 7
2
votes
1 answer

processing LZO sequence files with mrjob

I'm writing a task with mrjob to compute various statistics using the Google Ngrams data: https://aws.amazon.com/datasets/8172056142375670 I developed & tested my script locally using an uncompressed subset of the data in tab-delimited text. Once I…
burr
  • 529
  • 5
  • 8
2
votes
1 answer

python mrjob - gaierror: [Errno -2] Name or service not known

I'm trying to access s3 files from the mrjob module. Here's the code that's failing: from mrjob.emr import S3Filesystem fs = S3Filesystem("", "",…
Matt
  • 100
  • 8
2
votes
1 answer

mrjob: suppress key (or value) in reducer output

By default, mrJob stores the key and the value from output in key[tab]output format. This happens even if the key (or the value) is empty, null, or otherwise not interesting. Suppose my key, value pair is None, {"a":1", "b":1}. Then I get…
Abe
  • 22,738
  • 26
  • 82
  • 111
2
votes
2 answers

MRJOB open JSON file - Python

I am trying to load a json file as part of the mapper function but it returns "No such file in directory" although the file is existent. I am already opening a file and parsing through its lines. But want to compare some of its values to a second…
Nicolas Hung
  • 595
  • 1
  • 6
  • 15
2
votes
2 answers

Access distributed cache from MrJob

I am writing hadoop app using MrJob. I need to use distributed cache to access to some files. I know that there is an option -files in hadoop streaming but don't know how to access it in the program. Thanks for your help.
2
votes
2 answers

Is there a way to specify the title of a job from mrjob in the Hadoop Administration web interface?

I have several different jobs started from the Python library mrjob, including jobs with multiple steps. How can I replace streamjob with a custom name? For example, wordcount_step_1, wordcount_step_2, etc.
gak
  • 32,061
  • 28
  • 119
  • 154
2
votes
3 answers

MRjob: Can a reducer perform 2 operations?

I am trying to yield the probability each key,value pair generated from mapper has. So, lets say mapper yields: a, (r, 5) a, (e, 6) a, (w, 7) I need to add 5+6+7 = 18 and then find probabilities 5/18, 6/18, 7/18 so the final output from the reducer…
Nicolas Hung
  • 595
  • 1
  • 6
  • 15
2
votes
1 answer

MRJob and mapreduce task partitioning over Hadoop

I am trying to perform a mapreduce job using the Python MRJob lib and am having some issues getting it to properly distribute across my Hadoop cluster. I believe I am simply missing a basic principle of mapreduce. My cluster is a small, one master…
acnutch
  • 104
  • 1
  • 9
2
votes
1 answer

Unicode files with mrjob

I'm attempting to run a basic character count using mrjob. The file is a unicode UTF-8 text document that contains, among other symbols, Chinese characters. When I run the character count, I only get counts of symbols in the ASCII character set…
DevinRB
  • 107
  • 2
  • 8
1
vote
1 answer

How to steps differences reduce in Hadoop?

How to steps differences reduce in Hadoop? I have a problem with understand Hadoop. I have two files and first I did a join between those files. One file is about countries and the other is about client in each country. Example, clients.csv: Bertram…
DANIEL
  • 13
  • 5