Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
0 answers

How to solve the error "Object of type function is not json serializable"

I have a mapper and reducer function as below. from mrjob.job import MRJob from mrjob.step import MRStep class SortNumMoviesDesc(MRJob): def steps(self): return [MRStep(mapper=self.mapper_retrieve_counts, reducer =…
0
votes
0 answers

Python MRJob Script Sorting Results - Top Ten Words Syllable Count

I am trying to make a job that takes in a text file, only processes words that are not in the STOPWORDS set, counts the number of syllables in each word, then returns the top 10 words with the most syllables, sorting the results. I believe…
Tony M
  • 13
  • 4
0
votes
0 answers

Python Hadoop mrjob: subprocess.CalledProcessError: Command returned non-zero exit status 1

I'm using package mrjob on Python3.7 recently. I started hadoop and created an wordaccount.py file, which can calculate the frequency of each word in an .txt file. When I tried to run the file through python3 wordaccount.py -r hadoop…
yamato
  • 85
  • 1
  • 13
0
votes
1 answer

How to get the longest word in the MRjob

I'm trying to find the longest word in the text file through letter a->z. from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in…
Phat Phat
  • 1
  • 1
0
votes
0 answers

how to run mrjob with hdfs on ubuntu?

i am setting hadoop 3.3.1 on ubuntu. I can run jar file with hfds normaly ( use eclipse, add addition jar lib of hadoop then export). run mrjob local normaly but while i running mrjob with hdfs the errors had come. > python mrjob1.py -r hadoop…
robocon20x
  • 175
  • 8
0
votes
1 answer

calculate median of a list of values parallely using Hadoop map-reduce

I'm new to Hadoop mrjob. I have a text file which consists of data "id groupId value" in each line. I am trying to calculate a median of all values in the text file using Hadoop map-reduce. But i'm stuck when it comes to calculate only the median…
AdamA
  • 25
  • 6
0
votes
1 answer

Finding Top Ten Word Syllable Count

I am trying to make a job that takes in a text file, then counts the number of syllables in each word, then ultimately returns the top 10 words with the most syllables. I'm able to get all of the word/syllable pairs sorted in descending order,…
0
votes
1 answer

How to write a MRJob python for matrix addition

I have been trying to make simple matrix addition program with MRJob library. I have created this simple program as with a separate mapper and reducer it works fine locally and on Hadoop cluster now i am trying to create this program on a single…
0
votes
0 answers

my mapper function doesn't unpack all the values in python

I got a file that it has lines like this : Name_of_country,somedata,Max or min degree,degree,other data so it goes like this : France,xxx,TMAX,30,.... Germany,xxx,TMIN,40,.... France,xxx,TMIN,10,..... . . . now i tried this code i have wrote…
0
votes
1 answer

counting relative frequency in pairs a strips mapreduce

i am new in python and i want to use MrJob package for countind relative frequency of pair words i wrote below code but it doesn't make correct output. can you plz help me with my mistakes? (|) = (, )/()=(, )/∑A' (′ , ) import re from collections…
Learner
  • 39
  • 6
0
votes
1 answer

How to count same item with multi parameters in mrjob in python?

I'm trying to write a map-reduce function in python. I have a file that contains product information and I want to count the number of products that are members of the same category and have the same version. like this:
user17488887
0
votes
1 answer

my code is outputting a tuple of values and I would like it to be in individual pairs, i need help to understand how to modify it

def mapper(self, _, line): stop_words = set(["to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off",…
CKZ
  • 37
  • 5
0
votes
1 answer

Write a job that counts the frequencies of word first letters in a file. So if there are three words starting with "c" answer would be "c 3"

I have the below code and get the word count but getting the first letter frequency of all the words I don't understand how to do this. If there are three words starting with C in the file I would expect the outcome to be "C 3". I don't need to…
CKZ
  • 37
  • 5
0
votes
1 answer

Cannot run MapReduce job on AWS EMR Spark application

I am trying to run this example from mrjob about running a word count MapReduce job on AWS EMR. This is the word count code example from mrjob: from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): …
huy
  • 1,648
  • 3
  • 14
  • 40
0
votes
1 answer

How to import other python modules and packages

I have the following project structure, work_directory: merge.py a_package (i.e. a python file merge.py and a directory a_package under the directory "work_directory") I wrote a MapReduce job using MRJob in merge.py, in which I need to…
luw
  • 207
  • 3
  • 14