6

With performing some classificion using some user/item/rating data. My issue is how to I convert these 3 columns into a matrix of user(row), item(columns) and the ratings data populating the matrix.

User  Item  ItemRating
1     23    3
2     204   4
1     492   2
3     23    4

and so on. I tried using DataFrame but was getting NULL errors.

askewchan
  • 45,161
  • 17
  • 118
  • 134
user2822055
  • 119
  • 2
  • 9
  • How is it stored now? Is that a text file, or some kind of numpy or pandas object? – askewchan Nov 20 '13 at 19:02
  • And i can only use NumPy to perform this operation reading from a text file. There is no header information in the text file. – user2822055 Nov 20 '13 at 20:06
  • You can load it then using `arr = np.genfromtxt(filname, dtype=int)` If the first row does say `"User Item ItemRating"`, then use `arr = np.genfromtxt(filname, skip_header=1, dtype=int)` – askewchan Nov 20 '13 at 20:07
  • If you want to build the entire thing using numpy only (no pandas dependency) see: http://stackoverflow.com/q/17028329/1730674 – askewchan Nov 20 '13 at 20:08
  • great! thanks again for the knowledge. very much appreciated. – user2822055 Nov 20 '13 at 20:48
  • Hi. I am getting an error when I use this as my document read command: arr = np.genfromtxt("u_clean.data", skip_header=1)ValueError: Some errors were detected ! Line #3 (got 3 columns instead of 1), etc....for every odd number line. – user2822055 Nov 21 '13 at 03:00
  • Your file must have the same number of values per line (to fill a uniform array). It seems like one of your lines has only `1` value in it, but then a later line (#3) has three values in it. – askewchan Nov 21 '13 at 03:32

1 Answers1

19

This is pivot, if I get your idea right, with pandas it will be as follows.

Load data:

import pandas as pd
df = pd.read_csv(fname, sep='\s+', header=None)
df.columns = ['User','Item','ItemRating']

Pivot it:

>>> df
   User  Item  ItemRating
0     1    23           3
1     2   204           4
2     1   492           2
3     3    23           4
>>> df.pivot(index='User', columns='Item', values='ItemRating')
Item  23   204  492
User
1       3  NaN    2
2     NaN    4  NaN
3       4  NaN  NaN

For a numpy example, let's emulate file with StringIO:

from StringIO import StringIO
data ="""1     23    3
2     204   4
1     492   2
3     23    4"""

and load it:

>>> arr = np.genfromtxt(StringIO(data), dtype=int)
>>> arr
array([[  1,  23,  3],
       [  2, 204,  4],
       [  1, 492,  2],
       [  3,  23,  4]])

pivot is based on this answer

rows, row_pos = np.unique(arr[:, 0], return_inverse=True)
cols, col_pos = np.unique(arr[:, 1], return_inverse=True)
rows, row_pos = np.unique(arr[:, 0], return_inverse=True)
cols, col_pos = np.unique(arr[:, 1], return_inverse=True)
pivot_table = np.zeros((len(rows), len(cols)), dtype=arr.dtype)
pivot_table[row_pos, col_pos] = arr[:, 2]

and the result:

>>> pivot_table
array([[ 3,  0,  2],
       [ 0,  4,  0],
       [ 4,  0,  0]])

Note that results differ, as in second approach non-existing values are set to zero.

Select one that suits you better ;)

Community
  • 1
  • 1
alko
  • 46,136
  • 12
  • 94
  • 102
  • Heh, glad you were able to parse that :) – askewchan Nov 20 '13 at 19:15
  • @askewchan If you mean parsing raw data to dataframe, I do it each second pandas question, so just need to consult my own previous answers :) – alko Nov 20 '13 at 19:17
  • 1
    I meant understanding the question :P – askewchan Nov 20 '13 at 19:21
  • thanks and sorry if there was confusion in my problem statement. i am just learning python and programming concurrently and have little experience in stating the problem statement correctly. Do you need to bring in any additional add-in-s like panda, etc... – user2822055 Nov 20 '13 at 19:53
  • if i have text file can i just read in the data vy np.genfromtxt()? Then perform the pivot? – user2822055 Nov 20 '13 at 19:54
  • thank you so much alko. unfortunately i need to use only NumPy commands to perform this task since I have to perform some classification afterwards using previously learned numpy functions. – user2822055 Nov 20 '13 at 20:09
  • @user2822055 see for numpy and difference in result. Btw, you can get numpy underlying array for a dataframe with `df.values` – alko Nov 20 '13 at 20:14
  • @user2822055 It's not that you asked poorly, it's that I wasn't familiar with a pivot table until alko used the word and I was able to look it up :P – askewchan Nov 20 '13 at 20:57