Using NumPy to convert user/item ratings into 2-D array

Question

With performing some classificion using some user/item/rating data. My issue is how to I convert these 3 columns into a matrix of user(row), item(columns) and the ratings data populating the matrix.

User  Item  ItemRating
1     23    3
2     204   4
1     492   2
3     23    4

and so on. I tried using DataFrame but was getting NULL errors.

How is it stored now? Is that a text file, or some kind of numpy or pandas object? — askewchan, Nov 20 '13 at 19:02
And i can only use NumPy to perform this operation reading from a text file. There is no header information in the text file. — user2822055, Nov 20 '13 at 20:06
You can load it then using `arr = np.genfromtxt(filname, dtype=int)` If the first row does say `"User Item ItemRating"`, then use `arr = np.genfromtxt(filname, skip_header=1, dtype=int)` — askewchan, Nov 20 '13 at 20:07
If you want to build the entire thing using numpy only (no pandas dependency) see: http://stackoverflow.com/q/17028329/1730674 — askewchan, Nov 20 '13 at 20:08
great! thanks again for the knowledge. very much appreciated. — user2822055, Nov 20 '13 at 20:48
Hi. I am getting an error when I use this as my document read command: arr = np.genfromtxt("u_clean.data", skip_header=1)ValueError: Some errors were detected ! Line #3 (got 3 columns instead of 1), etc....for every odd number line. — user2822055, Nov 21 '13 at 03:00
Your file must have the same number of values per line (to fill a uniform array). It seems like one of your lines has only `1` value in it, but then a later line (#3) has three values in it. — askewchan, Nov 21 '13 at 03:32

score 19 · Accepted Answer · edited May 23 '17 at 12:16

19

This is pivot, if I get your idea right, with pandas it will be as follows.

Load data:

import pandas as pd
df = pd.read_csv(fname, sep='\s+', header=None)
df.columns = ['User','Item','ItemRating']

Pivot it:

>>> df
   User  Item  ItemRating
0     1    23           3
1     2   204           4
2     1   492           2
3     3    23           4
>>> df.pivot(index='User', columns='Item', values='ItemRating')
Item  23   204  492
User
1       3  NaN    2
2     NaN    4  NaN
3       4  NaN  NaN

For a numpy example, let's emulate file with StringIO:

from StringIO import StringIO
data ="""1     23    3
2     204   4
1     492   2
3     23    4"""

and load it:

>>> arr = np.genfromtxt(StringIO(data), dtype=int)
>>> arr
array([[  1,  23,  3],
       [  2, 204,  4],
       [  1, 492,  2],
       [  3,  23,  4]])

pivot is based on this answer

rows, row_pos = np.unique(arr[:, 0], return_inverse=True)
cols, col_pos = np.unique(arr[:, 1], return_inverse=True)
rows, row_pos = np.unique(arr[:, 0], return_inverse=True)
cols, col_pos = np.unique(arr[:, 1], return_inverse=True)
pivot_table = np.zeros((len(rows), len(cols)), dtype=arr.dtype)
pivot_table[row_pos, col_pos] = arr[:, 2]

and the result:

>>> pivot_table
array([[ 3,  0,  2],
       [ 0,  4,  0],
       [ 4,  0,  0]])

Note that results differ, as in second approach non-existing values are set to zero.

Select one that suits you better ;)

edited May 23 '17 at 12:16

Community

1
1

answered Nov 20 '13 at 19:12

alko

46,136
12
94
102

Heh, glad you were able to parse that :) – askewchan Nov 20 '13 at 19:15
@askewchan If you mean parsing raw data to dataframe, I do it each second pandas question, so just need to consult my own previous answers :) – alko Nov 20 '13 at 19:17
1

I meant understanding the question :P – askewchan Nov 20 '13 at 19:21
thanks and sorry if there was confusion in my problem statement. i am just learning python and programming concurrently and have little experience in stating the problem statement correctly. Do you need to bring in any additional add-in-s like panda, etc... – user2822055 Nov 20 '13 at 19:53
if i have text file can i just read in the data vy np.genfromtxt()? Then perform the pivot? – user2822055 Nov 20 '13 at 19:54
thank you so much alko. unfortunately i need to use only NumPy commands to perform this task since I have to perform some classification afterwards using previously learned numpy functions. – user2822055 Nov 20 '13 at 20:09
@user2822055 see for numpy and difference in result. Btw, you can get numpy underlying array for a dataframe with `df.values` – alko Nov 20 '13 at 20:14
@user2822055 It's not that you asked poorly, it's that I wasn't familiar with a pivot table until alko used the word and I was able to look it up :P – askewchan Nov 20 '13 at 20:57

Using NumPy to convert user/item ratings into 2-D array

1 Answers1