Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
1 answer

Attach the same EBS snapshot to every EMR volume?

I want to work with an EBS snapshot in an EMR job. Because the mapper reads from the snapshot, I want the snapshot mounted on every node. Is there an easy way to do that other than logging in to each node? I guess I could make the first step of my…
vy32
  • 28,461
  • 37
  • 122
  • 246
0
votes
1 answer

python: ImportError: cannot import name MRjob

In Canopy editor , while executing "from mrjob.job import MRjob" i am getting "ImportError: cannot import name MRjob" , not sure, whats wrong here. Anybody please suggest. Thanks much in advance Thanks & Regards, DP
0
votes
2 answers

MapReduce: Finding the triangles in a network graph using Mrjob

I've an application where I have a graph and I need to count the number of triangles in the graph using MrJob (MapReduce in Python). However, I'm having some trouble wrapping my head around the mapping and the reducing steps needed. What is the…
chribsen
  • 6,232
  • 4
  • 26
  • 24
0
votes
1 answer

Explanation of this MRJob example

from mrjob.job import job class KittyJob(MRJob): OUTPUT_PROTOCOL = JSONValueProtocol def mapper_cmd(self): return "grep kitty" def reducer(self, key, values): yield None, sum(1 for _ in values) if __name__ ==…
Ankur Agarwal
  • 23,692
  • 41
  • 137
  • 208
0
votes
0 answers

Processing MongoDB in AWS EMR with Python

I'm trying to do a map reduce using mrjob and Python against a MongoDB database. The mongodb-hadoop connector has examples on how to use AWS EMR but not with mrjob, and I'm not quite getting all the bits together. Here is what I have already as far…
Photonica
  • 1
  • 2
0
votes
1 answer

run mrjob on Amazon EMR, t2.micro not supported

I tried to run a mrjob script on Amazon EMR. It worked well when I used instance c1.medium, however, it had an error when I changed instnace to t2.micro. The full error message was shown below. C:\Users\Administrator\MyIpython>python word_count.py…
neil ye
  • 5
  • 1
0
votes
0 answers

mrjob ssh configuration error

I am new to mrjob and trying to run the basic word count script from mrjob document. I could run it successfully on emr by setting ssh false(ssh_tunnel_to_job_tracker: false). However, if I changed the option to true and run the script, I kept…
neil ye
  • 5
  • 1
0
votes
1 answer

mrjob error: DescribeJobFlows API is deprecated

I am using mrjob for the first time and try to run the basic word count code on EMR. I followed every step in the document of mrjob here yet still got an error.
neil ye
  • 5
  • 1
0
votes
1 answer

MRJob and python - .csv file output for Reducer?

I'm using the MRJob module for python 2.7. I have created a class that inherits from MRJob, and have correctly mapped everything using the inherited mapper function. Problem is, I would like to have the reducer function output a .csv file...here is…
0
votes
2 answers

how to get the average number of words in a text in mrjob?

Im stuck with a simple problem in mrjob mareduce framework: I want to get the average number of words in a given parragraph and i got this: class LineAverage(MRJob): def mapper(self, _, line): numwords = len(line.split()) yield "words",…
Dade
  • 33
  • 1
  • 8
0
votes
4 answers

Install BeautifulSoup, mrjob, pattern, and seaborn on python 2.7 on Jupyter

I am learning how to use the new Jupyter. I want to install packages:BeautifulSoup, mrjob, pattern, and seaborn on python 2.7. I first tried to do so by running pip install BeautifulSoup mrjob pattern seaborn That all returns: SyntaxError: invalid…
enaJ
  • 1,565
  • 5
  • 16
  • 29
0
votes
1 answer

mrjob InstanceProfile is required for creating cluster

I'm trying to run a instance on Amazon EC2 using python MRJob here is the simple python script to find the most used word in a txt file from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield…
hero
  • 11
  • 1
0
votes
3 answers

Installing Anaconda on Ubuntu 14.04 - installing mrjob

The installation went fine, except for the last three packages: mrjob, pattern, and seaborn I was able to install these from a terminal, however they installed into my old Python environment and not into the anaconda environment. How can I install…
0
votes
1 answer

mrjob NoFIleFound Exception with cloudera cdh 5 cluster

I am getting this error while trying to run mrjob example on the hadoop cluster. I have set up my hadoop_home and I can also create a new dir on the hdfs file system. I can run python map-reduce if I use hadoop streaming. It's only with mrjob I am…
user1525721
  • 336
  • 5
  • 12
0
votes
1 answer

MapReduce Job (written in python) run slow on EMR

I am trying to write a MapReduce job using python's MRJob package. The job processes ~36,000 files stored in S3. Each file is ~2MB. When I run the job locally (downloading the S3 bucket to my computer) it takes approximately 1 hour to run. However,…
DickJ
  • 313
  • 2
  • 9