Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
1
vote
1 answer

MRJob same key gets sent to different reducers

So I have Hadoop 2.7.1 installed on a 3 machine cluster. I'm trying to run an inverted index mapreduce job using MRJob and Hadoop Streaming. Here's my configuration: MRJob.SORT_VALUES = True def steps(self): JOBCONF_STEP1 = { …
Jack
  • 486
  • 2
  • 5
  • 19
1
vote
1 answer

Error when running mapreduce function

I have the following dataframe userID movieID rating timestamp 1 1 9 12 1 2 10 13 I called this dataframe mapper1.txt and stored it in the same dir as this python file: from mrjob.job import MRJob class MRRatingCounter(MRJob): …
Frits Verstraten
  • 2,049
  • 7
  • 22
  • 41
1
vote
1 answer

Bootstrapping dependencies on Amazon EMR with python Mrjob

I am trying to run a map reduce job on Amazon EMR with python Mrjob and I have some trouble installing dependencies. My mrjob code: from mrjob.job import MRJob import re from normalize import * from compute_features import * #Some code The…
1
vote
1 answer

Python mrjob mapreduce how to preprocess the input file

I am trying to pre-process a XML file to extract certain nodes before putting into mapreduce. I have the following code: from mrjob.compat import jobconf_from_env from mrjob.job import MRJob from mrjob.util import cmd_line, bash_wrap class…
DigitalPig
  • 83
  • 6
1
vote
1 answer

TotalOrderPartitioner and mrjob

How does one specify the TotalOrderPartitioner when using mrjob? Is this the default, or must it be specified explicitly? I've seen inconsistent behavior on different data sets.
vy32
  • 28,461
  • 37
  • 122
  • 246
1
vote
1 answer

How to populate a postgresql database with Mrjob and Hadoop

I would like to populate a database of Postgresql by using a mapper with MrJob and Hadoop 2.7.1. I currently using the following code: # -*- coding: utf-8 -*- #Script for storing the sparse data into a database by using Hadoop import psycopg2 import…
Nacho
  • 792
  • 1
  • 5
  • 23
1
vote
1 answer

Running Map/Reduce python programs from Sublime Text 2

I just started a tutorial series on map reduce and Hadoop. The set up instructions call for using an IDE called Canopy with MRjob. I have installed both, and everything works. But... If Canopy is just a Python IDE couldn't i use anything in its…
StillLearningToCode
  • 2,271
  • 4
  • 27
  • 46
1
vote
1 answer

Read multiple HDFS files or S3 files with mrjob?

I have a large amount of data stored in an HDFS system (or, alternatively, in Amazon S3). I want to process it using mrjob. Unfortunately, when run mrjob and give the HDFS file name or the containing directory name, I get an error. For example, here…
vy32
  • 28,461
  • 37
  • 122
  • 246
1
vote
1 answer

Regular expressions in python map reduce: Counting words with «ñ» and accented vowels

I use a regular expression in order to manipulate accented vowels and «ñ» in spanish texts in the following way: WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+") Although it works fine with any string, when I execute the map reduce program, it doesn't…
1
vote
0 answers

How to read images from Hadoop sequence file using opencv and MrJob?

I created sequence file from tar file full of images with tar-to-seq.jar.Now i want to create images out of bytes from that sequence file and to analyze them. Im using opencv 3.0.0 and mrjob 0.5 version. Im having troubles to read the image using…
Milos Miletic
  • 500
  • 6
  • 19
1
vote
2 answers

hadoop python job on snappy files produces 0 size output

When I run wordcount.py (python mrjob http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-your-first-job) using hadoop streaming on a text file it gives me the output, but when the same is run against .snappy files I got zero size…
user3369417
  • 358
  • 1
  • 4
  • 17
1
vote
0 answers

MrJob's MRJob.set_up_logging() function deprecated?

I'm writing a wrapper to run MrJob jobs with, and it's working quite well but I'd like to be able to deliver the Python stack trace from the job if it throws an exception. Originally, when something went wrong (i.e. an assert in the job code fails)…
Eli Rose
  • 6,788
  • 8
  • 35
  • 55
1
vote
1 answer

MrJob spends a lot of time Copying local files into hdfs

The problem I'm encountering is this: Having already put my input.txt (50MBytes) file into HDFS, I'm running python ./test.py hdfs:///user/myself/input.txt -r hadoop --hadoop-bin /usr/bin/hadoop It seems that MrJob spends a lot of time copying…
Nikos
  • 95
  • 1
  • 9
1
vote
0 answers

Running MapReduce job on hadoop remote cluster

I would like to know if there is a method for running MapReduce Job into a Hadoop remote cluster. In my University there is a cluster which has Hadoop installed, so I have been learning MapReduce for distributing Machine Learning jobs. However, I…
Nacho
  • 792
  • 1
  • 5
  • 23
1
vote
1 answer

socket.gaierror when trying to run emr using python mrjob

I currently trying to learn mrjob and how to implement it in AWS EMR so please forgive me if I am asking already asked question [searched many places but did not find the answer] and sorry if it is a silly question This is my python script : from…
The6thSense
  • 8,103
  • 8
  • 31
  • 65