13

I am struggling to figure out how to develop a square matrix given a format like

a a 0
a b 3
a c 4
a d 12
b a 3 
b b 0
b c 2
...

To something like:

  a b c d e
a 0 3 4 12 ... 
b 3 0 2 7 ... 
c 4 3 0 .. .
d 12 ...  
e . .. 

in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?

WolVes
  • 1,286
  • 2
  • 19
  • 39

1 Answers1

20

You could use df.pivot:

import pandas as pd

df = pd.DataFrame([['a', 'a', 0],
                   ['a', 'b', 3],
                   ['a', 'c', 4],
                   ['a', 'd', 12],
                   ['b', 'a', 3],
                   ['b', 'b', 0],
                   ['b', 'c', 2]], columns=['X','Y','Z'])

print(df.pivot(index='X', columns='Y', values='Z'))

yields

Y    a    b    c     d
X                     
a  0.0  3.0  4.0  12.0
b  3.0  0.0  2.0   NaN

Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.

See the docs for more on pivot and other reshaping methods.


Alternatively, you could use pd.crosstab:

print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1], 
                  values=df.iloc[:,2], aggfunc='sum'))

Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab (with agfunc='sum') will aggregate duplicate pairs by summing the associated values. Although there are no duplicate pairs in your posted example, specifying how duplicates are supposed to be aggregated is required when the values parameter is used.

Also, whereas df.pivot is passed column labels, pd.crosstab is passed array-likes (such as whole columns of df). df.iloc[:, i] is the ith column of df.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Given the orignal DF, the answer could be confusing to those not yet comfortable with pandas syntax. The parameters given inside the df.pivot are the names of the original df columns, not the indicies. So be sure to read it as "The column named '0' as the index" and not "The first (0th) column is the index" – Jeff Ellen Jul 27 '18 at 01:59
  • @JeffEllen: Good point. I've made an edit to hopefully make this clearer. – unutbu Jul 27 '18 at 02:21