Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
1
vote
1 answer

Python - Finding Top Ten Words Syllable Count

I am trying to make a job that takes in a text file, then counts the number of syllables in each word, then ultimately returns the top 10 words with the most syllables. I believe I have most of it down, but I am getting an error: File…
Tony M
  • 13
  • 4
1
vote
0 answers

MapReduce job that calculates chi-square values in Python

I am writing a MapReduce job in Python using MrJob. The sample of my dataset which is in JSON: {"reviewerID": "A2VNYWOPJ13AFP", "asin": "0981850006", "reviewerName": "Amazon Customer \"carringt0n\"", "helpful": [6, 7], "reviewText": "This was a gift…
Ario
  • 11
  • 1
1
vote
0 answers

How to input multiple files with MRJob

I am leraning hadoop and wanna use two diferent files in my script, but i don`t know the commado in terminal that do this. To read one file I use: python script.py hdfs://dataset/u.data -r hadoop I want to read the file u.item too, which is in the…
1
vote
0 answers

Master tasks on Core Nodes using AWS EMR Hadoop

Using EMR 6.X series, how does one ensure that master tasks run on Core nodes? Reading this page it looks like all it takes are two parameters: yarn.node-labels.enabled: true yarn.node-labels.am.default-node-label-expression: 'CORE' However…
Stephen
  • 107
  • 9
1
vote
1 answer

How to use multistep mrjob with json file

I'm trying to use hadoop to get some statistics from a json file like average number of stars for a category or language with most reviews. To do this I am using mrjob, I found this code: import re from mrjob.job import MRJob from mrjob.protocol…
PabloGS
  • 91
  • 9
1
vote
1 answer

Python mrjob - Finding 10 longest words, but mrjob returns duplicate words

I am using Python mrjob to find the 10 longest words from a text file. I have obtained a result, but the result contains duplicate words. How do I obtain only unique words (ie. remove duplicate words)? %%file most_chars.py from mrjob.job import…
Nekojell
  • 35
  • 4
1
vote
0 answers

Is there a way to pass command line arguments to mrjob?

Is there a way to pass command line arguments to mrjob, for example if you have a json file and you want to find how many keys have a certain, repeated value? In particular I have several json objects that have a location key and an item value so I…
JHBucy
  • 11
  • 1
1
vote
1 answer

How to do a Reduce Side Join as a Map Reduce Job with mrjob in Python

I have 2 datasets which I am trying to combine, namely the transactions dataset and the contract dataset, where I want to use address resp. to_address as the join attribute and the value attribute for the value. contract dataset fields: address,…
Hassan
  • 39
  • 6
1
vote
0 answers

How do you extract the line index in MrJob using MapReduce methods?

How do you extract the line index of any given line in MrJob? index_words = ["before", "remove"] class MRWordInvertedIndex(MRJob): # how to make the key(index) the line index of the corresponding value(line) in the input text file? def…
1
vote
0 answers

Use Pandas dataframe in mrJob

I have a python code and i need to use mrjob to make my python script more faster. How do I make below script to use mrJob? the below script works fine for small file, but when i run large file it takes forever. so I am planning to use mrJob which…
st_bones
  • 119
  • 1
  • 3
  • 12
1
vote
0 answers

Python dictionary weird behavior in mrjob

I'm writing a code that reads two input files and calculates some statistics like average rating by country. I'm using mrjob library, because the idea is that I'm able to run this on hadoop. Below are samples from those input files. Users…
jiipeezz
  • 235
  • 4
  • 10
1
vote
0 answers

Using MapReducer MRJob and my mapper function gives me an indexerror: list index out of range

I am new to MapReduce MRJob (and also to Python to be honest). I am trying to use MRJob to count the number of combinations of pairs of letters in different columns, from "A" to "E", that I have in a text file, i.e. "A", "A" = 10 occurences, "A",…
1
vote
0 answers

Morphological analysis of words with MRJob and Pymorphy2

Can anyone help with the MRJob and Pymorphy2? I am new to python and hadoop. I sort of understood how to perform text tokenisation, but I cannot understand how to morphologically disassemble the resulting tokens using Pymorphy2. Maybe I am doing…
GreatGohan
  • 63
  • 4
1
vote
1 answer

Mrjob Step is failing. How do debug?

I am trying to run sample mrjob in EMR cluster. I have created EMR cluster manually in AWS dashboard and started mrjob as follows python keywords.py -r emr s3://commoncrawl/crawl-data/CC-MAIN-2018-34/wet.paths.gz --cluster-id j-22GFG1FUGS12L Job is…
Javith
  • 41
  • 4
1
vote
0 answers

How to process images in Hadoop using python?

My objective is to apply map-reduce framework to cluster images using hadoop framework.For map-reduce i am using python programming and language and MRJOB package.But i am not able to create the logic of how to process the images. Like i have the…
Alay Majmudar
  • 60
  • 1
  • 9