I am trying to use the Million Song Dataset available on AWS to find the correlation between the loudness of a track and its popularity. I followed a basic tutorial (http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/) to get the data for each track, and have built my project using MRJob and Python. Now I am lost on how to find the correlation between the tracks while using a mapper and reducer. This is my code so far:
from mrjob.job import MRJob
import track
YIELD_ALL = True
class MRDensity(MRJob):
def mapper(self, _, line):
t = track.load_track(line)
if t:
if t['tempo'] > 0:
loudness = t['loudness']
#print loudness
hotness = t['song_hotttnesss']
xy = loudness * hotness
x2 = loudness * loudness
y2 = hotness * hotness
counter = counter + 1
yield (counter, (loudness, hotness, xy,x2,y2))
def reducer(self, key, val):
sumx2 = 0
sumy2 = 0
sumxy = 0
sumh = 0
suml = 0
for l, h, xy, x2, y2 in val:
suml = suml + l
sumh += h
sumxy += xy
sumx2 += x2
sumy2 += y2
yield key, suml
if __name__ == '__main__':
MRDensity.run()
This code is not really working, since it's yielding this:
1 -10.142
1 -10.212
1 -11.137
1 -11.197
1 -13.496
1 -15.568
1 -15.607
1 -17.302
1 -22.262
1 -3.383
1 -3.809
1 -5.816
1 -5.902
1 -6.671
1 -7.24
1 -7.591
1 -8.729
1 -9.689
1 -9.738
1 -9.863
I need help with writing the rest of the code to calculate the correlation between the loudness
and hotness
variables for the MSD dataset. Thanks!