I have a dataframe in spark, that is a list of (user, itemm rating)
user item rating
ust001 ipx001 5
ust002 ipx04 2
ust001 itx001 4
ust002 iox04 5
If assume I have n
users and m
items. I can construct a matrix A
with size nxm
'
my goal is to save use this matrix to compute ite-item similarity: B = A^T * A
, and save it as scipy sparse matrix B.npz
here is what I do in python
import numpy as np
import pandas as pd
import pickle
df = pd.read('user_item.paruet')
# mapping string to index
user2num = {}
item2num = {}
UID = 0
IID = 0
# remaping index to string
num2user = {}
num2ite ={}
# loop over all emelemt and map string to index
for i in range(len(df['user'])):
if df['user'][i] not in user2num:
user2num[df['user'][i]] = UID
num2ser[UID] = df['user'][i]
UID += 1
if df['item'][i] not in item2num:
item2num[df['item'][i]] = IID
num2item[IID] = df['item'][i]
IID += 1
# save the pair of string-index
with open('num2item.pickle', 'wb') as handle:
pickle.dump(num2item, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('item2num.pickle', 'wb') as handle:
pickle.dump(item2num, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('num2user.pickle', 'wb') as handle:
pickle.dump(num2user, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('user2num.pickle', 'wb') as handle:
pickle.dump(user2num, handle, protocol=pickle.HIGHEST_PROTOCOL)
df["user"] = df["user"].map(pan2num)
df["item"] = df["item"].map(mrch2num)
df.to_parquet('ID_user-item.parquet')
Then I have another script to compute matrix
# another file to compte item-item similarity
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy
df = pd.read_parquet('ID_user-item.parquet')
with open('num2item.pickle', 'rb') as handle:
item_id = pickle.load(handle)
with open('num2user.pickle', 'rb') as handle:
user_id = pickle.load(handle)
row = df['user'].values
col = df['item'].values
data = df['rating'].values
A = csr_matrix((data,(row, col)), shape=(len(user_id), len(item_id)))
B = csr_matrix((data,(col, row)), shape=(len(item_id), len(user_id)))
C = sparse.csr_matrix.dot(B, A)
scipy.sparse.save_npz('item-item.npz', C)
#based on num2item, I can remap the index to string to retrival the item-item similarity.
Above is okay for small dataset. However, If I have 500G
user-item-rating. The python will be alway out of memory.
My question is:
How can I obtain this item-item.npz
by using spark
, using the same logic?
Above is