How to calculate correlation between two variables in python using MapReduce

Question

I am trying to use the Million Song Dataset available on AWS to find the correlation between the loudness of a track and its popularity. I followed a basic tutorial (http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/) to get the data for each track, and have built my project using MRJob and Python. Now I am lost on how to find the correlation between the tracks while using a mapper and reducer. This is my code so far:

from mrjob.job import MRJob
import track
YIELD_ALL = True

class MRDensity(MRJob):

    def mapper(self, _, line):
    t = track.load_track(line)
    if t:
        if t['tempo'] > 0:

           loudness = t['loudness']
            #print loudness
           hotness = t['song_hotttnesss']
           xy = loudness * hotness
           x2 = loudness * loudness
           y2 = hotness * hotness
           counter = counter + 1
           yield (counter, (loudness, hotness, xy,x2,y2))

def reducer(self, key, val):
    sumx2 = 0
    sumy2 = 0
    sumxy = 0
    sumh = 0
    suml = 0

    for l, h, xy, x2, y2 in val:
        suml = suml + l
        sumh += h
        sumxy += xy
        sumx2 += x2
        sumy2 += y2
        yield key, suml

if __name__ == '__main__':
    MRDensity.run()

This code is not really working, since it's yielding this:

1   -10.142
1   -10.212
1   -11.137
1   -11.197
1   -13.496
1   -15.568
1   -15.607
1   -17.302
1   -22.262
1   -3.383
1   -3.809
1   -5.816
1   -5.902
1   -6.671
1   -7.24
1   -7.591
1   -8.729
1   -9.689
1   -9.738
1   -9.863

I need help with writing the rest of the code to calculate the correlation between the loudness and hotness variables for the MSD dataset. Thanks!

score 1 · Answer 1 · answered Mar 21 '13 at 00:17

You're actually pretty close. But first, the indentation of your code sample is totally wrong, which makes it harder to help you. Second, you didn't explain what it is about the output that you think is wrong.

From your code, I'm assuming you're trying to compute a linear regression for hotness vs. loudness.

To do that you want to sum a number of values over all the tracks in the database. So forget the counter variable in your mapper--you want to output one record at the end, so your mapper and reducer should be outputting a single key: Just use True or something. (Additionally, using a variable like that won't work if you run this code using Elastic Map-Reduce or even in multiple local processes.)

Then in your reducer, you should be doing yield key, (suml, sumh, sumxy, sumxx, sumyy).

The final output from your map-reduce will be a single line, something like this:

true    [-205.354, NaN, NaN, 2530.9249500000005, NaN]

Oops, NaNs aren't good. That happens because some of the tracks in the Million Song Dataset do not have a valid hotness. So you will need to use math.isnan in your mapper and only yield a record if the hotness is valid.

OK, now you'll get a final output like this:

true    [-50.804, 2.072952243828, -20.793643182685596, 538.98803, 0.9498767028116709]

You can use those values to compute the linear regression (for example, see the code at http://code.activestate.com/recipes/578129-simple-linear-regression/).

score -1 · Answer 2 · answered Mar 06 '13 at 05:49

-1

Try declaring counter at the top (globally)

from mrjob.job import MRJob
import track
YIELD_ALL = True
counter=0

And I really don't understand your logic in the reducer function.

answered Mar 06 '13 at 05:49

Atanu

61
2
12

Because when using MRJob, your code may be running in multiple processes on multiple machines. A counter variable won't be kept in sync across different processes. – John Wiseman Apr 25 '15 at 16:30

How to calculate correlation between two variables in python using MapReduce

2 Answers2