-1

my dataframe urm has a shape of (96438, 3)

user_id anime_id    user_rating
0   1   20  7.808497
1   3   20  8.000000
2   5   20  6.000000
3   6   20  7.808497
4   10  20  7.808497

i'm trying to build an item-user-rating matrix :

X = urm[["user_id", "anime_id"]].as_matrix()
y = urm["user_rating"].values
n_u = len(urm["user_id"].unique())
n_m = len(urm["anime_id"].unique())

R = np.zeros((n_u, n_m))
for idx, row in enumerate(X):
    R[row[0]-1, row[1]-1] = y[idx]

if the code succes the matrix looks like that : (i filled NaN with 0)

Matrix of item_rating-user

with in index user_id, anime_id in columns and rating for the value (i got this matrix from pivot_table)

is in some tutorial it works but there i got an

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-278-0e06bd0f3133> in <module>()
     15 R = np.zeros((n_u, n_m))
     16 for idx, row in enumerate(X):
---> 17     R[row[0]-1, row[1]-1] = y[idx]

IndexError: index 5276 is out of bounds for axis 1 with size 5143
user9176398
  • 441
  • 1
  • 4
  • 15
  • Please provide a [mcve]. In this case, the error does not match your data. In addition, show us what you *expect* from the output of your logic. – jpp Jul 04 '18 at 11:01

2 Answers2

2

I tried the second suggestion of dennlinger and it worked for me. This was the code I wrote:

def id_to_index(df):
    """
    maps the values to the lowest consecutive values
    :param df: pandas Dataframe with columns user, item, rating
    :return: pandas Dataframe with the extra columns index_item and index_user
    """

    index_item = np.arange(0, len(df.item.unique()))
    index_user = np.arange(0, len(df.user.unique()))

    df_item_index = pd.DataFrame(df.item.unique(), columns=["item"])
    df_item_index["new_index"] = index_item
    df_user_index = pd.DataFrame(df.user.unique(), columns=["user"])
    df_user_index["new_index"] = index_user

    df["index_item"] = df["item"].map(df_item_index.set_index('item')["new_index"]).fillna(0)
    df["index_user"] = df["user"].map(df_user_index.set_index('user')["new_index"]).fillna(0)


    return df
1

I am assuming you have non-consecutive user IDs (or movie IDs), which means that there exist indices that either have

  • no rating, or
  • no movie

In your case, you are setting up your matrix dimensions with the assumption that every value will be consecutive (since you are defining the dimension with the amount of unique values), which causes some non-consecutive values to reach out of bounds.

In that case, you have two options:

  • You can define you matrix to be of size urm["user_id"].max() by urm["anime_id"].max()
  • Create a dictionary that maps your values to the lowest consecutive values.

The disadvantage of the first approach is obviously that it requires you to store a bigger matrix. Also, you can use scipy.sparse to create a matrix from the data format you have (commonly referred to as the coordinate matrix format).
Potentially, you can do something like this:

from scipy import sparse
# scipy expects the data in (value_column, (x, y))
mat = sparse.coo_matrix((urm["user_rating"], (urm["user_id"], urm["anime_id"]))
# if you want it as a dense matrix
dense_mat = mat.todense()

You can then also work your way to the second suggestion, as I have previously asked here

dennlinger
  • 9,890
  • 1
  • 42
  • 63