Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

votes

1 answer

Executing mrjob boostrap commands on head-node only

I have a mrjob configuration that includes loading a large file from s3 into HDFS. I would like to include these commands in the configuration file, but it seems that all bootstrap commands execute on all of the nodes in the cluster. This is…

configuration mrjob

asked May 14 '14 at 16:57

user3637654

votes

1 answer

Can I use MRJob to process big files in local mode?

I have a relatively big file - around 10GB to process. I suspect it won't fit into my laptop's RAM, if MRJob decides to sort it in RAM or something similar. At the same time, I don't want to setup hadoop or EMR - the job is not urgent and I can…

mrjob

asked May 05 '14 at 10:39

Spaceman

1,185
4
17
31

votes

1 answer

How can I get and process a new S3 file for every iteration of an mrjob mapper?

I have a log file of status_changes, each one of which has a driver_id, timestamp, and duration. Using driver_id and timestamp, I want to fetch the appropriate GPS log from S3. These GPS logs are stored in an S3 bucket in the form…

python emr mrjob

asked Apr 24 '14 at 18:57

numbers are fun

votes

2 answers

how to divide file into chunks for multi processing

I have file of around 1.5 Gb and I want to divide file into chunks so that I can use multi processing to process each chunk using pp(parallel python) module in python. Till now i have used f.seek in python but it takes a lot of time, as it may be…

python algorithm file seek mrjob

asked Mar 03 '14 at 10:11

Aman Jagga

votes

1 answer

MapReduce: How to keep track of states across multiple lines in the mapper (say for counting trigrams)?

I'm trying to write a MapReduce program for computing Trigrams using the mrjob framework in Python. So far, this is what I have: from mrjob.job import MRJob class MRTrigram(MRJob): def mapper(self, _, line): w = line.split() …

python mapreduce mrjob

asked Mar 03 '14 at 03:36

TCSGrad

11,898
14
49
70

votes

1 answer

mrjob - automatic tar of source directory

I've created a Amazon EMR job using mrjob. My mapreduce job inherits from a common helper class to make my parsing of the apache log I'm parsing easier, the class I inherit from is shared amongst several mapreduce jobs, so this is my file structure:…

python amazon-web-services amazon-emr mrjob

asked Dec 18 '13 at 16:20

Swedish Zorro

votes

1 answer

mrJob python mapReduce word_count.py

I have just started using mrJob (mapReduce for python) and am new to the MapReduce paradigm, I would like to know the following about the word_count.py tutorial that is present on the MRJob documentation site. The docs say that if we create a…

python mapreduce mapper word-count mrjob

asked Nov 14 '13 at 01:49

anonuser0428

11,789
22
63
86

votes

1 answer

python mrjob moduel not found on CDH virtual machine

I'm using Mrjob to run python code in Hadoop. I'm using a CDH package with virtual machine on a single node cluster. My mrjob ran correctly when I tested the code locally but when I ran on Hadoop cluster, it throw an error: No module named…

python hadoop virtual-machine mrjob

asked Oct 22 '13 at 17:41

sky jiao

votes

1 answer

How can I write an iteration in Python using mrjob mapper reducer, for which the counter is a part of the computation in the loop?

I have a program that iterates a mapper and a reducer n times consecutively. However, for each iteration, the mapper of each key-value pair computes a value that depends on n. from mrjob.job import mrjob class MRWord(mrjob): def…

python mapper mrjob reducers

asked Sep 28 '13 at 15:30

Pippi

2,451
8
39
59

votes

1 answer

Why does the main statement in a mrjob Python program accept only one line of code?

I want to know how long a mrjob program runs. However, I get an unindent does not match any outer indentation level error if I put in time.time() before and after MRWord.run(), and I couldn't find any documentation about this. What am I…

python indentation mrjob

asked Sep 24 '13 at 03:19

Pippi

2,451
8
39
59

votes

2 answers

How should data files be included to mrjob on EMR?

I am trying to run a mrjob on Amazon's EMR. I've tested the job locally using the inline runner, but it fails when running on Amazon. I've narrowed the failure down to my dependence on an external data file zip_codes.txt. If I run without that…

python mapreduce amazon-emr emr mrjob

asked Sep 24 '13 at 00:40

fixedpoint

1,575
1
17
24

votes

1 answer

How to create a hadoop runner?

I have the following simple mrjob script, which reads a large file line by line, performs an operation on each line and prints the output: #!/usr/bin/env python …

python hadoop mrjob

asked Aug 27 '13 at 01:30

Frank

64,140
93
237
324

votes

1 answer

How do I cancel mrJob once it's running? ^C doesn't work

Is there a simple way to make mrJob scripts interruptable? Pretty simple question, but it makes a big difference for debugging. I'm mainly interested in canceling python-only test jobs, because this is where most debugging happens. python…

mrjob

asked Apr 25 '13 at 18:07

Abe

22,738
26
82
111

votes

1 answer

MRJob - Python - How to return null when division is 0/value

How can I modify this code so when senti_avg is not divisible (0/value), reducer() outputs NULL or NONE instead of crashing? def reducer(self, bs_id, value): avg_data = list(value) senti_sum = sum([a[0] for a in avg_data]) word_sum =…

python mrjob

asked Apr 09 '13 at 14:07

Nicolas Hung

votes

0 answers

EMR No output for a long time

I have a MapReduce job written in python using MRJob library. The job takes around 30 mins to complete on my local machine. While running the same job on the EMR, I am seeing no output for a long time (~=1hr). I had to close down the job. Also the…

python hadoop mapreduce emr mrjob

asked Jan 18 '13 at 11:42

Read Q

1,405
2
14
26

Prev 1 2 3

…

22 23 Next