Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
0
votes
0 answers

mrjob fail to mkdir hadoop directory

This is my first time using mrjob, however I encounter the following problems when executing the relevant python script using mrjob: No configs found; falling back on auto-configuration Looking for hadoop binary in…
Joey
  • 21
  • 2
0
votes
1 answer

Connecting HIVE in MRJob

The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster. We are in a thought of using MRJob to process the the input file and use…
0
votes
1 answer

MrJob on Hadoop can't import libraries

I am using CDH 5.7.2 and MrJob to submit a MapReduce job When I try in localmode, everything works fine, but when i use -r hadoop It gives me following error: Task Id : attempt_1471071791922_0005_m_000001_2, Status : FAILED Error:…
Vadym B.
  • 681
  • 7
  • 21
0
votes
0 answers

mrjob limit filesize for output files

Does anyone know how to limit the max size of s3 output files (part-r-00000, part-r-00001 ... etc) from mrjob? I'm compressing the output if that makes any difference using the following in my .mrjob.conf file: jobconf: mapred.output.compress:…
Digan
  • 23
  • 4
0
votes
1 answer

Object is not recognized while performing a mapreduce job

Im trying to run a simple map reduce job and got the following datasets: bike.txt 1 Bike 1 2 Bike 2 3 Bike 4 4 Bike 4 5 Bike 4 bikenames.txt 1,Aap 2,Noot 3,Greet 4,Mies 5,Gazelle My aim is to write a mapreduce job that out the name of…
Frits Verstraten
  • 2,049
  • 7
  • 22
  • 41
0
votes
1 answer

Error while running MRJOB on AWS

I put the mrjob.conf file in /home directory and tried to run the job from command and I am getting this error: File "/Users/bimalthapa/anaconda/lib/python2.7/site-packages/mrjob-0.4.6- py2.7.egg/mrjob/conf.py", line 283, in conf_object_at_path …
dev
  • 451
  • 4
  • 10
  • 23
0
votes
0 answers

'ImportError: No module named' even after installation

At Jupyter Notebook, I try to import 'mrjob' Python 2.7 package after making sure that the package has been installed (pip2.7 list), but receive an error. Does anyone know what am I missing? from mrjob.job import…
Fxs7576
  • 1,259
  • 4
  • 23
  • 31
0
votes
1 answer

Submit jobs to EMR cluster using MRJob

MRJob waits until each job completes before giving back control to the user. I broke down a large EMR step into smaller ones and would like to submit them all in one shot. The docs talk about programmatically submitting tasks, but the sample code…
Pykler
  • 14,565
  • 9
  • 41
  • 50
0
votes
1 answer

Identifying false alert using python mapreduce

Can someone help me regarding the following problem. I am trying to analyze a security log to find false alerts. The false alerts are those containing "TXT was not created" and true are with "txt was not created". How can I extract the particular…
Shiv
  • 1
  • 1
0
votes
2 answers

MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running locally or on emr: self.options.runner ==…
Pykler
  • 14,565
  • 9
  • 41
  • 50
0
votes
1 answer

ImportError: No module named step

I am coding mapreducer in python with mrjob libaries. I installed mrjob package but when i from mrjob.step import MRStep it appear error : from mrjob.step import MRStep ImportError: No module named step Anyone can help me? Thanks so much
0
votes
0 answers

How to limit the number of processes in a local MRJob task

I am running a MapReduce job on an 8-core machine using MRJob. I wrote it using the Python API, and I run it as $ python main.py -r local files/input* There are ~750 input files in that folder, and when I run it that way, I believe mrjob launches…
user1496984
  • 10,957
  • 8
  • 37
  • 46
0
votes
1 answer

psycopg2.ProgrammingError: relation * already exists while populatig a database via MRjob

I'm trying to populate a postgresql database by using MRjob. Some days ago someone kindly suggested me here to divide in steps the mapper. I tried but an error is given: python db_store_hadoop.py -r local --dbname=en_ws xSparse.txt no configs found;…
Nacho
  • 792
  • 1
  • 5
  • 23
0
votes
1 answer

STDERR output from Hadoop, this does mean some issue?

I'm using Mrjob-Hadoop with Python2.7, Ubuntu 14.04 and I had the following screen output: no configs found; falling back on auto-configuration no configs found; falling back on auto-configuration creating tmp directory…
Nacho
  • 792
  • 1
  • 5
  • 23
0
votes
1 answer

Scaling a python mrjob program on Apache Hadoop

I am trying to run a simple mapreduce program on HDInight through Azure. My program is written in python and simply counts the how many rows of numbers (timeseries) meet certain criteria. The final result are just counts for each category. My code…
klib
  • 697
  • 2
  • 11
  • 27