0

I have a dataframe in spark, that is a list of (user, itemm rating)

user item rating
ust001 ipx001   5
ust002 ipx04   2
ust001 itx001   4
ust002 iox04   5

If assume I have n users and m items. I can construct a matrix A with size nxm'

my goal is to save use this matrix to compute ite-item similarity: B = A^T * A, and save it as scipy sparse matrix B.npz

here is what I do in python

import numpy as np
import pandas as pd
import pickle

df = pd.read('user_item.paruet')

# mapping string to index
user2num = {}
item2num = {}
UID = 0
IID = 0
# remaping index to string
num2user = {}
num2ite ={}

# loop over all emelemt and map string to index
for i in range(len(df['user'])):
    if df['user'][i] not in user2num:
        user2num[df['user'][i]] = UID
        num2ser[UID] = df['user'][i]
        UID += 1
        
    if df['item'][i] not in item2num:
        item2num[df['item'][i]] = IID
        num2item[IID] = df['item'][i]
        IID += 1

# save the pair of string-index
with open('num2item.pickle', 'wb') as handle:
    pickle.dump(num2item, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('item2num.pickle', 'wb') as handle:
    pickle.dump(item2num, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('num2user.pickle', 'wb') as handle:
    pickle.dump(num2user, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('user2num.pickle', 'wb') as handle:
    pickle.dump(user2num, handle, protocol=pickle.HIGHEST_PROTOCOL)



df["user"] = df["user"].map(pan2num)
df["item"] = df["item"].map(mrch2num)

df.to_parquet('ID_user-item.parquet')

Then I have another script to compute matrix

# another file to compte item-item similarity
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy

df = pd.read_parquet('ID_user-item.parquet')

with open('num2item.pickle', 'rb') as handle:
    item_id = pickle.load(handle)

with open('num2user.pickle', 'rb') as handle:
    user_id = pickle.load(handle)


row = df['user'].values
col = df['item'].values
data = df['rating'].values

A = csr_matrix((data,(row, col)), shape=(len(user_id), len(item_id)))
B = csr_matrix((data,(col, row)), shape=(len(item_id), len(user_id)))
C =  sparse.csr_matrix.dot(B, A)

scipy.sparse.save_npz('item-item.npz', C)
#based on num2item, I can remap the index to string to retrival the item-item similarity.

Above is okay for small dataset. However, If I have 500G user-item-rating. The python will be alway out of memory.

My question is:

How can I obtain this item-item.npz by using spark, using the same logic? Above is

jason
  • 1,998
  • 3
  • 22
  • 42
  • https://georgheiler.com/2021/08/06/scalable-sparse-matrix-multiplication/ perhaps? – Georg Heiler Jun 19 '22 at 04:15
  • Does this answer your question? [Matrix Multiplication in Apache Spark](https://stackoverflow.com/questions/33558755/matrix-multiplication-in-apache-spark) – Matt Andruff Jun 20 '22 at 12:41
  • Thanks, I think it is close. the first step, I need to cnvert dataframe to matrix, and then do matrix multuplication. I am still thinking how to do the first step. – jason Jun 20 '22 at 17:09

0 Answers0