0

I have a csv file containing user ratings for about ‭56,124‬ items (columns) for about 3000 users (rows). The rating is an integer less than 128. I have this function:

def sparse_to_npz(file, npz):
  print("Reading " + file + " ...")
  data_items = pd.read_csv(file)

  # Create a new dataframe without the user ids.
  data_items = data_items.drop('u', 1)

  # As a first step we normalize the user vectors to unit vectors.

  # magnitude = sqrt(x2 + y2 + z2 + ...)
  magnitude = np.sqrt(np.square(data_items).sum(axis=1))

  # unitvector = (x / magnitude, y / magnitude, z / magnitude, ...)
  data_items = data_items.divide(magnitude, axis='index')
  del magnitude

  print("Saving to " + npz)
  data_sparse = sparse.csr_matrix(data_items)
  del data_items
  sparse.save_npz(npz, data_sparse)
  #np.save("columns", data_items.columns.values)

which is passed two files: input csv file (sparse, every user with all items ratings), and should output npz file to save memory. After the file is read using pandas and stored in data_items, we need to calculate the magnitude and divide the data_items by it, then finally saving the npz file. The problem is that I am getting my error at the step of calculating the mag. using np.sqrt(np.square(... on machine with 12 GB memory. How can I make it work?

FindOutIslamNow
  • 1,169
  • 1
  • 14
  • 33
  • Load in batches and do 100 users at a time. There's a description in an answer here https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas – kevinkayaks May 25 '19 at 14:20
  • The problem is that this using cosine similarity, which needs ALL data at single time to be calculated – FindOutIslamNow May 25 '19 at 19:17
  • That comment isn't clear to me. There's no cosine similarity in your question. It appears on closer inspection you can batch compute the L2 norm on the columns. If you post example data with expected output you'll get some traction – kevinkayaks May 25 '19 at 20:43

0 Answers0