Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
1 answer

Executing mrjob boostrap commands on head-node only

I have a mrjob configuration that includes loading a large file from s3 into HDFS. I would like to include these commands in the configuration file, but it seems that all bootstrap commands execute on all of the nodes in the cluster. This is…
0
votes
1 answer

Can I use MRJob to process big files in local mode?

I have a relatively big file - around 10GB to process. I suspect it won't fit into my laptop's RAM, if MRJob decides to sort it in RAM or something similar. At the same time, I don't want to setup hadoop or EMR - the job is not urgent and I can…
Spaceman
  • 1,185
  • 4
  • 17
  • 31
0
votes
1 answer

How can I get and process a new S3 file for every iteration of an mrjob mapper?

I have a log file of status_changes, each one of which has a driver_id, timestamp, and duration. Using driver_id and timestamp, I want to fetch the appropriate GPS log from S3. These GPS logs are stored in an S3 bucket in the form…
numbers are fun
  • 423
  • 1
  • 7
  • 12
0
votes
2 answers

how to divide file into chunks for multi processing

I have file of around 1.5 Gb and I want to divide file into chunks so that I can use multi processing to process each chunk using pp(parallel python) module in python. Till now i have used f.seek in python but it takes a lot of time, as it may be…
Aman Jagga
  • 301
  • 5
  • 15
0
votes
1 answer

MapReduce: How to keep track of states across multiple lines in the mapper (say for counting trigrams)?

I'm trying to write a MapReduce program for computing Trigrams using the mrjob framework in Python. So far, this is what I have: from mrjob.job import MRJob class MRTrigram(MRJob): def mapper(self, _, line): w = line.split() …
TCSGrad
  • 11,898
  • 14
  • 49
  • 70
0
votes
1 answer

mrjob - automatic tar of source directory

I've created a Amazon EMR job using mrjob. My mapreduce job inherits from a common helper class to make my parsing of the apache log I'm parsing easier, the class I inherit from is shared amongst several mapreduce jobs, so this is my file structure:…
0
votes
1 answer

mrJob python mapReduce word_count.py

I have just started using mrJob (mapReduce for python) and am new to the MapReduce paradigm, I would like to know the following about the word_count.py tutorial that is present on the MRJob documentation site. The docs say that if we create a…
anonuser0428
  • 11,789
  • 22
  • 63
  • 86
0
votes
1 answer

python mrjob moduel not found on CDH virtual machine

I'm using Mrjob to run python code in Hadoop. I'm using a CDH package with virtual machine on a single node cluster. My mrjob ran correctly when I tested the code locally but when I ran on Hadoop cluster, it throw an error: No module named…
sky jiao
  • 79
  • 6
0
votes
1 answer

How can I write an iteration in Python using mrjob mapper reducer, for which the counter is a part of the computation in the loop?

I have a program that iterates a mapper and a reducer n times consecutively. However, for each iteration, the mapper of each key-value pair computes a value that depends on n. from mrjob.job import mrjob class MRWord(mrjob): def…
Pippi
  • 2,451
  • 8
  • 39
  • 59
0
votes
1 answer

Why does the main statement in a mrjob Python program accept only one line of code?

I want to know how long a mrjob program runs. However, I get an unindent does not match any outer indentation level error if I put in time.time() before and after MRWord.run(), and I couldn't find any documentation about this. What am I…
Pippi
  • 2,451
  • 8
  • 39
  • 59
0
votes
2 answers

How should data files be included to mrjob on EMR?

I am trying to run a mrjob on Amazon's EMR. I've tested the job locally using the inline runner, but it fails when running on Amazon. I've narrowed the failure down to my dependence on an external data file zip_codes.txt. If I run without that…
fixedpoint
  • 1,575
  • 1
  • 17
  • 24
0
votes
1 answer

How to create a hadoop runner?

I have the following simple mrjob script, which reads a large file line by line, performs an operation on each line and prints the output: #!/usr/bin/env python …
Frank
  • 64,140
  • 93
  • 237
  • 324
0
votes
1 answer

How do I cancel mrJob once it's running? ^C doesn't work

Is there a simple way to make mrJob scripts interruptable? Pretty simple question, but it makes a big difference for debugging. I'm mainly interested in canceling python-only test jobs, because this is where most debugging happens. python…
Abe
  • 22,738
  • 26
  • 82
  • 111
0
votes
1 answer

MRJob - Python - How to return null when division is 0/value

How can I modify this code so when senti_avg is not divisible (0/value), reducer() outputs NULL or NONE instead of crashing? def reducer(self, bs_id, value): avg_data = list(value) senti_sum = sum([a[0] for a in avg_data]) word_sum =…
Nicolas Hung
  • 595
  • 1
  • 6
  • 15
0
votes
0 answers

EMR No output for a long time

I have a MapReduce job written in python using MRJob library. The job takes around 30 mins to complete on my local machine. While running the same job on the EMR, I am seeing no output for a long time (~=1hr). I had to close down the job. Also the…
Read Q
  • 1,405
  • 2
  • 14
  • 26