Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
2 answers

How to execute a job on a server on Lambda without waiting for the response?

I am trying to spawn a mapreduce job using the mrjob library from AWS Lambda. The job takes longer than the 5 minute Lambda time limit, so I want to execute a remote job. Using the paramiko package, I ssh'd onto the server and ran a nohup command to…
user2820906
  • 195
  • 1
  • 15
0
votes
1 answer

how does this sentence(yield "lines", 1) work in mrjob's Official document

I'm trying to understand the official example for mrjob clearly def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def reducer(self, key, values): yield key,…
reddevilzs
  • 18
  • 5
0
votes
2 answers

MapReduce job container killed by Google Cloud Platform [Error code:143]

I tried to run a mapreduce job on a cluster in Google Cloud Platform using Python package mrjob as follows: python mr_script.py -r dataproc --cluster-id [CLUSTER-ID] [gs://DATAFILE_FOLDER] I can successfully run the very same script against the…
0
votes
1 answer

Parsing HTML .txt files in Hadoop via MapReduce using Python

I am very new to using the Hadoop platform and defining MapReduce functions, and I am having a difficult time trying to understand why this mapper is not working in my MapReduce script. I am trying to parse a collection of pages written as a string…
Wilson
  • 253
  • 2
  • 9
0
votes
1 answer

Amazon EMR: while attaching EBS volume to instance how to be sure that this volume is to be used

In my mrjob.conf i make settings for the additional volume: Instances.InstanceGroups.member.2.EbsConfiguration.EbsBlockDeviceConfigs.member.1.VolumeSpecification.SizeInGB: 250 …
mirt
  • 1,453
  • 1
  • 17
  • 35
0
votes
1 answer

Python Error: No module named mrjob.job

I am running the following simple script from the book and getting the following error from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield(word, 1) def reducer(self, word,…
muazfaiz
  • 4,611
  • 14
  • 50
  • 88
0
votes
1 answer

Python: How can I index in MapReduce(MRJob)?

I want to index the result of reducer like this : 1 "EZmocAborM6z66rTzeZxzQ" 2 "FIk4lQQu1eTe2EpzQ4xhBA" 3 "myql3o3x22_ygECb8gVo7A" 4 "ojovtd9c8GIeDiB8e0mq2w" 5 "uVEoZmmL9yK0NMgadLL0CQ" My Python MRJob code : class MRUserDic(MRJob): …
user3595632
  • 5,380
  • 10
  • 55
  • 111
0
votes
1 answer

How to run a MRJob in a local Hadoop Cluster with Hadoop Streaming?

I'm currently taking a Big Data Class, and one of my projects is to run my Mapper/Reducer on a Hadoop Cluster which is set up locally. I've been using Python along with the MRJob library for the class. Here is my current Python Code for the…
J.Halon
  • 57
  • 1
  • 9
0
votes
1 answer

Hadoop Error: Error launching job , bad input path : File does not exist.Streaming Command Failed

I am running an MRJob on Hadoop cluster & I am getting the following error: No configs found; falling back on auto-configuration Looking for hadoop binary in $PATH... Found hadoop binary: /usr/local/hadoop/bin/hadoop Using Hadoop version…
bhoots21304
  • 47
  • 11
0
votes
0 answers

Hadoop Error "Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1"

mapper.py is working fine. I ran mapper.py on my cluster and stored its output in part-0.txt. Excatly like a word-count job, I am trying to count the occurrences of every distinct key stored in part-0.txt file. I tried copy-pasting the code from…
bhoots21304
  • 47
  • 11
0
votes
2 answers

Running MapReduce from Jupyter Notebook

I am trying to run MapReduce from Jupyter Notebook on a dataset in u.data file, but I keep receiving an error message that says "TypeError: 'str' object doesn't support item deletion". How can I make the code runs successfully? The u.data…
Fxs7576
  • 1,259
  • 4
  • 23
  • 31
0
votes
1 answer

mrjob does not work on Amazon EMR 5.x, but does run on EMR4.8.3

I'm using mrjob on Amazon EMR. It works without flaw on EMR 4.8.3, but when I run it on EMR 5.x (any of them), something goes bonkers in the hadoop streaming API and I just get a lot of errors. My mrjob program is a very simple program that does…
vy32
  • 28,461
  • 37
  • 122
  • 246
0
votes
1 answer

Top N Record MapReduce on Python

I am new on MapReduce and I have a very simple question. I solved WordCount problem and then I want to change the problem as Top N record on text. Although I sort all the words on text but I can not take last N value. First, I read text and send…
ugur
  • 400
  • 6
  • 20
0
votes
0 answers

Company name matching Common Crawl using mrjob

I have a list of company name and details like ph.no, address, email etc.,. I want to get their company_url. We thought of using google API to make requests but it turns out to be costly. After searching I found Common_Crawl which was somewhat close…
0
votes
2 answers

Read the text from a file and sort according to numbers

I have a text file, say: cat 2 dog 4 bird 20 animal 3 I want to read this file and sort like this (according to numbers): cat 2 animal 3 dog 4 bird 20 Code tried so far: def txtsort(self, _, line): words = [] …
bharath
  • 11
  • 2