I am trying to run a simple mapreduce program on HDInight through Azure. My program is written in python and simply counts the how many rows of numbers (timeseries) meet certain criteria. The final result are just counts for each category. My code is shown below.
from mrjob.job import MRJob
import numpy as np
import time
class MRTimeSeriesFrequencyCount(MRJob):
def mapper(self, _, line):
series = [float(i) for i in line.split(',')]
diff = list(np.diff(series))
avg = sum(diff) / len(diff)
std = np.std(diff)
fit = np.polyfit(list(range(len(series))), series, deg=1)
yield "Down", 1 if (series[len(series)-1]-series[0]) < 0 else 0
yield "Up", 1 if (series[len(series)-1]-series[0]) > 0 else 0
yield "Reverse", 1 if (fit[0]*(series[len(series)-1]-series[0])) < 0 else 0
yield "Volatile", 1 if std/avg > 0.33 else 0
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
start_time = time.time()
MRTimeSeriesFrequencyCount.run()
print("--- %s seconds ---" % (time.time() - start_time))
I am new to mapreduce and hadoop. When I scale up the number of rows, which are stored in a csv, my laptop which is an HP Elitebook 8570w still performs faster than running the code in Hadoop (456 seconds vs 628.29 seconds for 1 million rows). The cluster has 4 worker nodes with 4 cores each and 2 head nodes with 4 cores each. Shouldn't it perform faster? Is there some other bottleneck such as reading in the data? Is mrjob running it on only one node? Thanks in advance for the help.