0

I am new to numpy.I have referred to the following SO question: Why NumPy instead of Python lists?

The final comment in the above question seems to indicate that numpy is probably slower on a particular dataset.

I am working on a 1650*1650*1650 data set. These are essentially similarity values for each movie in the MovieLens data set along with the movie id.

My options are to either use a 3D numpy array or a nested dictionary. On a reduced data set of 100*100*100, the run times were not too different.

Please find the Ipython code snippet below:

for id1 in range(1,count+1):
    data1 = df[df.movie_id == id1].set_index('user_id')[cols]
    sim_score = {}
    for id2 in range (1, count+1):
        if id1 != id2:
            data2 = df[df.movie_id == id2].set_index('user_id')[cols]
            sim = calculatePearsonCorrUnified(data1, data2) 
        else: 
            sim = 1
        sim_matrix_panel[id1]['Sim'][id2] = sim



import pdb
from math import sqrt
def calculatePearsonCorrUnified(df1, df2):

sim_score = 0
common_movies_or_users = []

for temp_id in df1.index:
    if temp_id in df2.index:
        common_movies_or_users.append(temp_id)
#pdb.set_trace()
n = len(common_movies_or_users)
#print ('No. of common movies: ' + str(n))
if n == 0:
    return sim_score;

# Ratings corresponding to user_1 / movie_1, present in the common list 
rating1 = df1.loc[df1.index.isin(common_movies_or_users)]['rating'].values
# Ratings corresponding to user_2 / movie_2, present in the common list 
rating2 = df2.loc[df2.index.isin(common_movies_or_users)]['rating'].values


sum1 = sum (rating1)
sum2 = sum (rating2)

# Sum up the squares
sum1Sq = sum (np.square(rating1))
sum2Sq = sum (np.square(rating2))

# Sum up the products
pSum = sum(np.multiply(rating1, rating2))

# Calculate Pearson score
num = pSum-(sum1*sum2/n)
den = sqrt(float(sum1Sq-pow(sum1,2)/n) * float(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
sim_score = (num/den)

return sim_score    

What would be the best way to most precisely time the runtime with either of these options?

Any pointers would be greatly appreciated.

Community
  • 1
  • 1
neeraj baji
  • 211
  • 2
  • 12
  • 4
    It is impossible that nested lists perform as well as a numpy arrays under any conditions, your timing must be wrong. Are you referring to the last answer that is downvoted in that question? Your code obviously uses pandas which is another layer on top of numpy, no nested dictionaries there. Please provide a [minimal working example](https://stackoverflow.com/help/mcve), as it is not clear what you are asking. Short answer: use numpy (or pandas). – rth Jun 09 '15 at 11:57
  • 2
    @rth I wouldn't say any conditions. There is an overhead to using numpy, which means for (very) small data sets numpy will take longer. For instance, where `l` is a list of 10 integers and `a` is an array of those same integers, then `a.sum()` is 10 times slower than `sum(l)`. – Dunes Jun 09 '15 at 12:59
  • @Dunes Yes, good point, thanks for pointing it out. I shouldn't have been that categorical in my comment. Still that should be will be true, for array with a number of elements larger than 1000 and even more so for multi-dimensional arrays, which is the case here. – rth Jun 09 '15 at 13:27
  • @rth - Please add your suggestion to use Numpy as an answer so that I can accept it. My issue was 3D data and I ended up using pandas Panel. Ipython's %%time indicates that the code takes 144 minutes to run. Is this expected? I have updated the code in the question to essentially cover the entire code being run. – neeraj baji Jun 11 '15 at 14:03

0 Answers0