Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
1 answer

How to iterate through lines in MRJob

I have a text document full of lines of tweets that I need to run a MapReduce job on. I am using Python and MRJob to do so with the following code: from mrjob.job import MRJob import re import datetime class exerciseOne(MRJob): def…
faboys
  • 57
  • 1
  • 1
  • 8
0
votes
0 answers

How to access hdfs files direclty in python?

I am working on Hadoop and Spark Framework for clustering of images. I am using Python as my programming language.For map-reduce framework MRJOB package is used. The doubt i am having is how to access the hdfs files directly in python? For example…
Alay Majmudar
  • 60
  • 1
  • 9
0
votes
1 answer

mrjob returned non-zero exit status 256

I'm new to map reduce and I'm trying to run a map reduce job using mrjob package of python. However, I encountered this error: ERROR:mrjob.launch:Step 1 of 1 failed: Command '['/usr/bin/hadoop', 'jar',…
kkesley
  • 3,258
  • 1
  • 28
  • 55
0
votes
1 answer

mrjob add_file_arg() csv file

I'm having trouble understanding how to use the add_file_arg() for mrjob. I'm trying to pass a csv to my mapper with a person's attributes and find the attributes for each person in my mapper. This is my code thus far: class MRPeopleScores(MRJob): …
person10559
  • 57
  • 1
  • 11
0
votes
1 answer

Creating new SparkContext for each SparkStep in MRJob/ pySpark

I am new to pySpark and I'm trying to implement a multi-step EMR/Spark job using MRJob, do I need to create a new SparkContext for each SparkStep, or can I share the same SparkContext for all SparkSteps? I tried to look up the MRJob manual but…
vkc
  • 556
  • 2
  • 8
  • 18
0
votes
1 answer

How to prematurely terminate MrJob reducer?

I want to use MapReduce to filter a huge dataset for rare entities satisfying some criteria. I could speed this up a lot by terminating reducers once they violate the criteria, since they will be computing on entities that I'm not interested in. To…
crypdick
  • 16,152
  • 7
  • 51
  • 74
0
votes
1 answer

Not a valid jar when I was running an example of Hadoop

I am learning Hadoop recently. I am using sandbox on virtualbox. I downloaded a python script with mrjob frame and run the following command, python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar…
Jacob
  • 1
  • 1
0
votes
0 answers

Recreate Python dictionary results in MapReduce?

Can't get my head around why standard Python code produces an unexpected result when translated to MapReduce using mrjob. Example data from a .txt file: 1 12 1 14 1 15 1 16 1 18 1 12 2 11 2 11 2 13 3 12 3 15 3 11 3 10 This code creates…
RDJ
  • 4,052
  • 9
  • 36
  • 54
0
votes
1 answer

Failed package installation in Python

I am trying to install the Mrjob package for Python and I get the following error: AJs-MacBook-Pro-13:~ aj$ conda install -c asmeurer mrjob Fetching package metadata ............. Solving package specifications: . UnsatisfiableError: The following…
aj31
  • 157
  • 1
  • 15
0
votes
1 answer

How to process rows from SQL query with MRJob

I am having hard time figuring out how MRJob works. I am trying to make an sql query and yield its rows, and in the documentation there is no such thing explained in details. My code so far: # To be able to give db file as option. def…
B1nd0
  • 110
  • 1
  • 8
0
votes
1 answer

mapreduce for word frequency in Python

I want my python program to output a list of the top ten most frequently used words and their associated word count. I have to use mrjob - mapreduce to created this program. I wrote a program that finds the frequency of the words and outputs them in…
Anna
  • 11
  • 2
  • 5
0
votes
0 answers

run several jobs in a single file with mr job

I have different jobs in separate py file. These jobs do different operations on the csv file. Can I join all these jobs in cascading mode into one file and save the final output to the csv file?
hesse
  • 3
  • 6
0
votes
0 answers

Use MRJOB to count bigram: accur type error

I am a newcomer using map-reduce program with Mrjob. I need to use Mrjob to count bi-grams. Here is my code: import mrjob from mrjob.job import MRJob import re from itertools import islice, izip import itertools WORD_RE =…
0
votes
1 answer

Java error:org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedPartitioner not org.apache.hadoop.mapred.Partitioner

Exception in thread "main" java.lang.RuntimeException: class org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedPartitioner not org.apache.hadoop.mapred.Partitioner at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:2273) at…
Raj
  • 368
  • 1
  • 5
  • 17
0
votes
1 answer

-partitioner : class not found : org.apache.Hadoop.mapred.lib.KeyFieldBasedPartitioner

I am writing MRjob and want to partition my reducer output on key based. And I am using these options and get following error.How to use keyfieldbasedpartitioner? Do I need download something for this. And MRJOB is written in python. Step 1 of 1…
Raj
  • 368
  • 1
  • 5
  • 17