Scaling a python mrjob program on Apache Hadoop

Question

I am trying to run a simple mapreduce program on HDInight through Azure. My program is written in python and simply counts the how many rows of numbers (timeseries) meet certain criteria. The final result are just counts for each category. My code is shown below.

from mrjob.job import MRJob
import numpy as np
import time

class MRTimeSeriesFrequencyCount(MRJob):

def mapper(self, _, line):

    series = [float(i) for i in line.split(',')]
    diff = list(np.diff(series))
    avg = sum(diff) / len(diff)
    std = np.std(diff)
    fit = np.polyfit(list(range(len(series))), series, deg=1)
    yield "Down", 1 if (series[len(series)-1]-series[0]) < 0 else 0
    yield "Up", 1 if (series[len(series)-1]-series[0]) > 0 else 0
    yield "Reverse", 1 if (fit[0]*(series[len(series)-1]-series[0])) < 0 else 0
    yield "Volatile", 1 if std/avg > 0.33 else 0

def reducer(self, key, values):
    yield key, sum(values)


if __name__ == '__main__':
    start_time = time.time()
    MRTimeSeriesFrequencyCount.run()
    print("--- %s seconds ---" % (time.time() - start_time))

I am new to mapreduce and hadoop. When I scale up the number of rows, which are stored in a csv, my laptop which is an HP Elitebook 8570w still performs faster than running the code in Hadoop (456 seconds vs 628.29 seconds for 1 million rows). The cluster has 4 worker nodes with 4 cores each and 2 head nodes with 4 cores each. Shouldn't it perform faster? Is there some other bottleneck such as reading in the data? Is mrjob running it on only one node? Thanks in advance for the help.

The size of the data that I have tested is 1gb but I can easily test larger sets. When do you think I will see an advantage with hadoop? — klib, Dec 10 '15 at 19:10
With such a small data size, your Hadoop job won't be faster than running a standalone program. — Manjunath Ballur, Dec 11 '15 at 15:17

score 2 · Accepted Answer · edited May 23 '17 at 12:15

2

As I known, Hadoop need some time to prepare startup for M/R job & data on HDFS. So you can't get faster performance for a small data set on Hadoop cluster than on local single machine.

You have 1 million rows data. I assume that the data size of one row is 1 KB, so the data size of 1 million rows is about 1 GB. It's a small data set for Hadoop so that the time saved not enough to make up for the latency time of startup before running really on Hadoop.

As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that its marked answer explained the latency of your issue.

edited May 23 '17 at 12:15

Community

1
1

answered Dec 10 '15 at 08:06

Peter Pan

23,476
4
25
43

Do you know large of a dataset will I need to get a performance advantage? – klib Dec 10 '15 at 19:04
1

@klib As I known, the dataset size on Hadoop need to over TB level generally for performance advantage. – Peter Pan Dec 11 '15 at 00:44

Scaling a python mrjob program on Apache Hadoop

1 Answers1