Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
331 questions
1
vote
4 answers

Is there a good library that helps chain MapReduce jobs using Hadoop Streaming and Python?

This question answers part of my question but not completely. How do I run a script that manages this, is it from my local filesystem? Where exactly do things like MrJob or Dumbo come into picture? Are there any more alternative? I am trying to run…
1
vote
2 answers

MRJob MR assign to Dictionary instead of Yield?

I'm new to MRJob and MR and I was wondering in the traditional word count python example for MRJob MR: from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield…
Michael
  • 7,087
  • 21
  • 52
  • 81
1
vote
1 answer

How can I cannot index into the values list of reduce?

I am using in-mapper combining in a Map Reduce job via the Python mrjob module. Because I wrote a mapper_final function that emits a single pair, I am sure that only a single key-value pair is emitted to my reducers. However, my reduce function is…
dangerChihuahua007
  • 20,299
  • 35
  • 117
  • 206
1
vote
2 answers

How do all the reducers come up with a single answer?

I am beginning to learn MapReduce with the mrjob python package. mrjob documentation lists the following snippet as an example MapReduce script. """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import…
dangerChihuahua007
  • 20,299
  • 35
  • 117
  • 206
1
vote
3 answers

Write some data (lines) from my mappers to separate directories depending on some logic in my mapper code

I am using mrjob for my EMR needs. How do I write some data (lines) from my mappers to "separate directories" depending on some logic in my mapper code that I can: tar gzip and upload to separate S3 buckets (depending on the directory name) after…
newToFlume
  • 51
  • 1
  • 8
1
vote
2 answers

Is there a way to determine the filename passed to a map job in Hadoop/Dumbo/Mrjob?

All, I am working on creating an interface for dealing with some massive data and generating arff files for doing some machine learning stuff with. I can currently collect the features- but I have no way of associating them with the files they were…
sampwing
  • 1,238
  • 1
  • 10
  • 13
0
votes
1 answer

How to receive a list of dictionaries as an argument for a MRJob job?

I understand how to programmatically receive the output, as well as how to run a MRJob job. This is clearly explained here. However I'm struggling to understand how to pass a list of dictionaries or any variables from another file into a MrJob job.…
Kayer
  • 41
  • 7
0
votes
0 answers

splitting comma separated data in python

SOLVED solution at the end of the question.... I'm making a map reduce code using MRjob in python and i have a CSV dataset following are few rows from the dataset. column headings Year Length Title Genre Actor Actress Director …
hadi khan
  • 41
  • 5
0
votes
0 answers

How do I sort the output of this MapReduce MRJob task

I have trouble sorting the output of this map reduce task. It has to be sorted in the order of words then years. I have tried the following code but it does not return sorted output. from mrjob.job import MRJob class Job(MRJob): def…
Grit 1000
  • 21
  • 1
0
votes
0 answers

MRJob program not showing any optput

i have implemented a python program using Mrjob to capture network packets and then plotting the graph. from mrjob.job import MRJob import socket import struct import sys import time import matplotlib.pyplot as plt import pyshark class…
0
votes
0 answers

mrjob configure_args() error: unrecognized arguments

I can't figure out what the error is in my case when creating an argument via add_file_arg() for mrjob. I'm trying to pass names from csv to my mapper and find attributes for each name in the mapper. This is my code so far: from mrjob.job import…
Berenika
  • 1
  • 2
0
votes
0 answers

Does backtrader or backtesting.py work with mapreduce and/or mrjob?

Would it be possible to backtest using either backtesting.py or backtrader doing mapreduce with the mrjob library or another? Unsure if backtrader or backtesting.py works with mapreduce/mrjob or if we will have to write some extra code to use…
Andy
  • 1
0
votes
1 answer

Conversion from String to Integer is not working while using MRJob

I'm writing a simple program which uses the mrjob library to map and reduce rows from a csv file. One of the columns from a row is a yearID. This column is by default read in as a Str. I need to convert it to an Int so that I can compare it. For…
0
votes
0 answers

TypeError: cannot unpack non-iterable float object - MapReduce - mrjob

I'm testing a simple example to learn about MapReduce and mrjob. The goal is to sum up the logarithm of all the numbers and divide the count of all numbers by this summation. The code is pretty easy and straightforward: # mrMedian.py from mrjob.job…
Shahriar.M
  • 818
  • 1
  • 11
  • 24
0
votes
1 answer

Run Python mrjob in a Kubernetes on Hadoop Cluster

I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly. I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the…
Thisara Watawana
  • 344
  • 4
  • 15