0

My pandas data frame looks something like this:

Movieid review  movieRating     wordEmbeddingVector
 1       "text"    4          [100 dimensional vector]

I am trying to run a doc2vec implementation and I want to be able to group by movie ids and the take the sum of the vectors in wordEmbeddingVector and calculate a cosine similarity between the summed vector and the input vector I tried doing

movie_groupby = movie_data.groupby('movie_id').agg(lambda v : cosineSimilarity(np.sum(movie_data['textvec'])), inputvector)

But it seemed to run for ages and I thought I might be doing something wrong. So I tried to remove the similarity function and just group by and sum. But this seems to not finish as well (well 1 hour and up now) Am I doing something wrong or is it actually just that slow? I have 135392 rows in my data frame so its not massive.

movie_groupby = movie_data.groupby('movie_id').agg(lambda v : np.sum(movie_data['textvec']))

Much Appreciated!

Roshini
  • 703
  • 2
  • 8
  • 21

1 Answers1

0

There is a bug in your code. Inside your lambda function you sum across the entire dataframe instead of just the group. This should fix things:

movie_groupby = movie_data.groupby('movie_id').agg(lambda v: np.sum(v['textvec']))

Note: I replaced hotel_data with movie_data, but that must have been just a typo.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • aah I understand..because v is a row but like a mini dataframe...yes hotel is a typo..I was trying to follow a blog from tripadvisor! Thanks a lot :) – Roshini Jun 03 '16 at 09:51
  • Yes, the function is applied to each group, and each group is a dataframe. Once you understand that you understand `groupby` :) – IanS Jun 03 '16 at 10:11
  • yeah I think I was more confused with what the lambda was selecting...as in what v was..if it was the group or if it expects just the vector or if its a dataframe..... But I understand now! – Roshini Jun 03 '16 at 15:44