Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
0 answers

How do you build oddjob on Ubuntu (classpath errors)?

I am running Ubuntu 14.04, and using mrjob to run some Hadoop tasks on Amazon Elastic MapReduce. I'd like to use oddjob with it. oddjob is a Java package. I have not used Java in a decade, so I'm getting stuck what I think are relatively simple…
user274045
  • 872
  • 6
  • 6
0
votes
1 answer

ImportError: No module named job

I'm trying to import mrjob so I can run a script. It was working fine about an hour ago, and then I changed some code around to try to make my job faster. when I run this import: from mrjob.job import MRJob I get this: Traceback (most recent call…
user3609964
0
votes
1 answer

How to run multiple mrjob tasks with different parameters

I have such a job: from mrjob.job import MRJob from mrjob.step import MRStep import urllib import re import httpagentparser UA_STRING = re.compile(MYSUPERCOMPLEXREGEX) class MRReferralAnalysis(MRJob): def mapper(self, _, line): for…
Stephan Kristyn
  • 15,015
  • 14
  • 88
  • 147
0
votes
1 answer

Map Reduce that counts a parameter from a line and then count a second parameter

Imagine I have a logfile full of lines: "a,b,c", whereas these are variables that can have any value, but re-occurances of the values do happen and that is what this analysis will be about. First Step Map all 'c' URLs, where 'a' equals a specific…
Stephan Kristyn
  • 15,015
  • 14
  • 88
  • 147
0
votes
1 answer

MRJob fails to star new jobs on EMR when using --pool-emr-job-flows

I am using MRJob to run an iterative hadoop program on Amazon's EMR. Everything works fine (but slowly) when I'm not using the "--pool-emr-job-flows" option. When I use this option, Traceback (most recent call last): File "ic_bfs_eval.py", line…
JoelO
  • 101
  • 2
0
votes
1 answer

MRJob fails with Hadoop error copyToLocal: [...] No such file or directory

MRJob fails with error I'm running a simple Hadoop job using MRJob on a EMR cluster. The job starts normally but then Job launched 181.2s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 211.4s ago, status STARTING: Provisioning…
lechatpito
  • 557
  • 6
  • 8
0
votes
0 answers

Optimise RegEx Time complexity of mrJob Script containing regex

How could you optimise this MapRduce Job (mrjob): Using this script now, any idea how to optimse? I am using a lookahead to search for the ur=www.domain.de and then mapping and counting the r2 occurneces. from mrjob.job import MRJob from mrjob.step…
Stephan Kristyn
  • 15,015
  • 14
  • 88
  • 147
0
votes
1 answer

mrjob bad --steps error using make_runner on Hadoop cluster

I'm trying to run simple wordcount example programatically, but I can't make the code work on hadoop cluster. job in test_job.py: from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def…
Mehraban
  • 3,164
  • 4
  • 37
  • 60
0
votes
2 answers

Can not import module in Mrjob

I've tried change wordcount example using mrjob. My structure project is: ├── input_text.txt ├── store_xml_dir │   ├── xml_file.xml │   └── xml_parse.py └── wordcount.py and content of wordcount.py is: import os import sys cwdir =…
Trigges
  • 1
  • 4
0
votes
1 answer

Mrjob failed when running on hadoop with lxml library

I'm working on a project using hadoop mapreduce. My project tree have showed in this picture: MyProject ├── parse_xml_file.py ├── store_xml_directory │   └── my_xml_file.xml ├── requirements.txt ├── input_to_hadoop.txt └── testMrjob.py I've run…
kha
  • 349
  • 3
  • 18
0
votes
0 answers

MRJOB reducer gives no output on EMR but provides output when run in local machine

When I execute a MapReduce job on a local setup I get the desired output from the reducer while the same code on EMR does not produce any. I have a cluster setup of 1 master and 10 core. This is the output. There is no error displayed Map-Reduce…
MUKUND
  • 11
  • 1
0
votes
1 answer

In python MRJob, how to set up the option for tempory output directory

I am using MRJob to run the very simple word count as a standard hadoop job: python word_count.py -r hadoop hdfs:///path-to-my-data This print error indicating that it can not create the temporary directory for temporary output: STDERR: mkdir:…
Causality
  • 1,123
  • 1
  • 16
  • 28
0
votes
1 answer

What is a specific syntax example to load S3 data to HDFS prior to running steps in MRJob?

When I run my MRJob script and use the CLI to spin up EMR clusters for the work, I am trying to figure out how to load the data from S3 onto HDFS in the clusters. I want to do this as part of the setup process. I've searched a number of places to…
nyghtowl
  • 68
  • 1
  • 7
0
votes
2 answers

Invalid ssh key running mrjob script on emr

I'm going through this guide on how to get mrjob working on EMR. I follow all the steps, but when I run the example script I get this error: matthew@WinterMute:~/work/projects/mrjob_examples$ python word_count.py -r emr moby.txt using configs in…
mdornfe1
  • 1,982
  • 1
  • 24
  • 42
0
votes
1 answer

How does one read binary input files in mrjob?

The input to my MapReduce program is a set of binary files. I want to be able to read them through mrjob. After some research it seems I have to write a custom hadoop streaming jar. Is there a simpler way? Or is such a jar readily available? …
krishnapp
  • 3
  • 1