5

I am trying to do a cross tab based on one column where a third column matches. Take the example data:

df = pd.DataFrame({'demographic' : ['A', 'B', 'B', 'A', 'C', 'C'],
                'id_match' : ['101', '101', '201', '201', '26', '26'],
                'time' : ['10', '10', '16', '16', '1', '1']})

where id_match matches i want to find the resulting sum of of the time for the cross tab of the demographic column. The output would look like this:

  A  B  C
A 0  52 0
B 52 0  0
C 0  0  2

Hopefully that makes sense, comment if not. Thanks J

cs95
  • 379,657
  • 97
  • 704
  • 746
JDraper
  • 349
  • 2
  • 11
  • 2
    Can you please re-check the output? Are those values correct? – cs95 Dec 17 '18 at 17:27
  • Looks ok to me, 52 is from the first four lines of the table: 10 + 10 + 16 + 16, where the id matches twice in the four rows and both matching instances correspond to A / B. 2 is from the id_match of 26, summing the time gets me 2 – JDraper Dec 17 '18 at 17:32
  • Hmm, okay... I thought it would be 26. Maybe I was wrong. – cs95 Dec 17 '18 at 17:40
  • This is what I came up with: https://pastebin.com/raw/h9ztsz8L Maybe you can modify it to fit your need. – cs95 Dec 17 '18 at 18:25
  • @coldspeed thanks for your help, that worked. If you post your comment as an answer i will accept – JDraper Dec 18 '18 at 09:58

1 Answers1

1

You can solve this using merge and crosstab:

u = df.reset_index()
v = u.merge(u, on='id_match').query('index_x != index_y')
r = pd.crosstab(v.demographic_x, 
                v.demographic_y, 
                v.time_x.astype(int) + v.time_y.astype(int), 
                aggfunc='sum')

print(r)
demographic_y     A     B    C
demographic_x                 
A               NaN  52.0  NaN
B              52.0   NaN  NaN
C               NaN   NaN  4.0

If you need the NaNs filled in with zeros, you can use fillna:

r.fillna(0, downcast='infer')

demographic_y   A   B  C
demographic_x           
A               0  52  0
B              52   0  0
C               0   0  4
cs95
  • 379,657
  • 97
  • 704
  • 746
  • I think my original output was correct. It just depends on what the user wants. In the code you provided me I added an additional step to sum the time_x and time_y to give me to the total time for each id_match. Thanks for your help! J – JDraper Dec 18 '18 at 10:20
  • @JDraper I see. That makes more sense. Let me edit in a sec. – cs95 Dec 18 '18 at 11:11