0

I am trying to run a simple mapreduce program on HDInight through Azure. My program is written in python and simply counts the how many rows of numbers (timeseries) meet certain criteria. The final result are just counts for each category. My code is shown below.

from mrjob.job import MRJob
import numpy as np
import time

class MRTimeSeriesFrequencyCount(MRJob):

def mapper(self, _, line):

    series = [float(i) for i in line.split(',')]
    diff = list(np.diff(series))
    avg = sum(diff) / len(diff)
    std = np.std(diff)
    fit = np.polyfit(list(range(len(series))), series, deg=1)
    yield "Down", 1 if (series[len(series)-1]-series[0]) < 0 else 0
    yield "Up", 1 if (series[len(series)-1]-series[0]) > 0 else 0
    yield "Reverse", 1 if (fit[0]*(series[len(series)-1]-series[0])) < 0 else 0
    yield "Volatile", 1 if std/avg > 0.33 else 0

def reducer(self, key, values):
    yield key, sum(values)


if __name__ == '__main__':
    start_time = time.time()
    MRTimeSeriesFrequencyCount.run()
    print("--- %s seconds ---" % (time.time() - start_time))

I am new to mapreduce and hadoop. When I scale up the number of rows, which are stored in a csv, my laptop which is an HP Elitebook 8570w still performs faster than running the code in Hadoop (456 seconds vs 628.29 seconds for 1 million rows). The cluster has 4 worker nodes with 4 cores each and 2 head nodes with 4 cores each. Shouldn't it perform faster? Is there some other bottleneck such as reading in the data? Is mrjob running it on only one node? Thanks in advance for the help.

klib
  • 697
  • 2
  • 11
  • 27

1 Answers1

2

As I known, Hadoop need some time to prepare startup for M/R job & data on HDFS. So you can't get faster performance for a small data set on Hadoop cluster than on local single machine.

You have 1 million rows data. I assume that the data size of one row is 1 KB, so the data size of 1 million rows is about 1 GB. It's a small data set for Hadoop so that the time saved not enough to make up for the latency time of startup before running really on Hadoop.

As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that its marked answer explained the latency of your issue.

Community
  • 1
  • 1
Peter Pan
  • 23,476
  • 4
  • 25
  • 43