Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

331 questions

votes

0 answers

How do you build oddjob on Ubuntu (classpath errors)?

I am running Ubuntu 14.04, and using mrjob to run some Hadoop tasks on Amazon Elastic MapReduce. I'd like to use oddjob with it. oddjob is a Java package. I have not used Java in a decade, so I'm getting stuck what I think are relatively simple…

java ubuntu hadoop classpath mrjob

asked Nov 14 '14 at 12:20

user274045

votes

1 answer

ImportError: No module named job

I'm trying to import mrjob so I can run a script. It was working fine about an hour ago, and then I changed some code around to try to make my job faster. when I run this import: from mrjob.job import MRJob I get this: Traceback (most recent call…

python mrjob

asked Nov 13 '14 at 04:48

user3609964

votes

1 answer

How to run multiple mrjob tasks with different parameters

I have such a job: from mrjob.job import MRJob from mrjob.step import MRStep import urllib import re import httpagentparser UA_STRING = re.compile(MYSUPERCOMPLEXREGEX) class MRReferralAnalysis(MRJob): def mapper(self, _, line): for…

python mapreduce mrjob

asked Nov 03 '14 at 09:35

Stephan Kristyn

15,015
14
88
147

votes

1 answer

Map Reduce that counts a parameter from a line and then count a second parameter

Imagine I have a logfile full of lines: "a,b,c", whereas these are variables that can have any value, but re-occurances of the values do happen and that is what this analysis will be about. First Step Map all 'c' URLs, where 'a' equals a specific…

python hadoop mapreduce mrjob

asked Oct 31 '14 at 15:25

Stephan Kristyn

15,015
14
88
147

votes

1 answer

MRJob fails to star new jobs on EMR when using --pool-emr-job-flows

I am using MRJob to run an iterative hadoop program on Amazon's EMR. Everything works fine (but slowly) when I'm not using the "--pool-emr-job-flows" option. When I use this option, Traceback (most recent call last): File "ic_bfs_eval.py", line…

python hadoop mrjob

asked Oct 30 '14 at 13:28

JoelO

votes

1 answer

MRJob fails with Hadoop error copyToLocal: [...] No such file or directory

MRJob fails with error I'm running a simple Hadoop job using MRJob on a EMR cluster. The job starts normally but then Job launched 181.2s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 211.4s ago, status STARTING: Provisioning…

python amazon-s3 emr amazon-emr mrjob

asked Oct 27 '14 at 21:26

lechatpito

votes

0 answers

Optimise RegEx Time complexity of mrJob Script containing regex

How could you optimise this MapRduce Job (mrjob): Using this script now, any idea how to optimse? I am using a lookahead to search for the ur=www.domain.de and then mapping and counting the r2 occurneces. from mrjob.job import MRJob from mrjob.step…

regex hadoop mrjob

asked Oct 23 '14 at 12:28

Stephan Kristyn

15,015
14
88
147

votes

1 answer

mrjob bad --steps error using make_runner on Hadoop cluster

I'm trying to run simple wordcount example programatically, but I can't make the code work on hadoop cluster. job in test_job.py: from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def…

python hadoop mrjob

asked Oct 19 '14 at 11:01

Mehraban

3,164
4
37
60

votes

2 answers

Can not import module in Mrjob

I've tried change wordcount example using mrjob. My structure project is: ├── input_text.txt ├── store_xml_dir │ ├── xml_file.xml │ └── xml_parse.py └── wordcount.py and content of wordcount.py is: import os import sys cwdir =…

python mrjob

asked Oct 19 '14 at 10:09

Trigges

votes

1 answer

Mrjob failed when running on hadoop with lxml library

I'm working on a project using hadoop mapreduce. My project tree have showed in this picture: MyProject ├── parse_xml_file.py ├── store_xml_directory │ └── my_xml_file.xml ├── requirements.txt ├── input_to_hadoop.txt └── testMrjob.py I've run…

lxml hadoop-streaming mrjob

asked Oct 18 '14 at 15:25

kha

votes

0 answers

MRJOB reducer gives no output on EMR but provides output when run in local machine

When I execute a MapReduce job on a local setup I get the desired output from the reducer while the same code on EMR does not produce any. I have a cluster setup of 1 master and 10 core. This is the output. There is no error displayed Map-Reduce…

hadoop emr mrjob

asked Oct 08 '14 at 12:54

MUKUND

votes

1 answer

In python MRJob, how to set up the option for tempory output directory

I am using MRJob to run the very simple word count as a standard hadoop job: python word_count.py -r hadoop hdfs:///path-to-my-data This print error indicating that it can not create the temporary directory for temporary output: STDERR: mkdir:…

hadoop hadoop-streaming mrjob

asked Sep 09 '14 at 20:40

Causality

1,123
1
16
28

votes

1 answer

What is a specific syntax example to load S3 data to HDFS prior to running steps in MRJob?

When I run my MRJob script and use the CLI to spin up EMR clusters for the work, I am trying to figure out how to load the data from S3 onto HDFS in the clusters. I want to do this as part of the setup process. I've searched a number of places to…

java hdfs emr mrjob

asked Jul 11 '14 at 07:19

nyghtowl

votes

2 answers

Invalid ssh key running mrjob script on emr

I'm going through this guide on how to get mrjob working on EMR. I follow all the steps, but when I run the example script I get this error: matthew@WinterMute:~/work/projects/mrjob_examples$ python word_count.py -r emr moby.txt using configs in…

amazon-web-services ssh emr mrjob

asked Jun 03 '14 at 18:24

mdornfe1

1,982
1
24
42

votes

1 answer

How does one read binary input files in mrjob?

The input to my MapReduce program is a set of binary files. I want to be able to read them through mrjob. After some research it seems I have to write a custom hadoop streaming jar. Is there a simpler way? Or is such a jar readily available? …

binaryfiles mrjob

asked May 18 '14 at 21:17

krishnapp

Prev 1 2 3

…

22 23 Next