0

I have my data in melted Pandas dataframe (code for data below):

student course order
Jerry A 1
Jerry B 2
Jerry C NaN
Jessy C 1
Jessy A 2
Jessy B 3
Raphael A 1
Raphael C 2
Raphael C 3
Raphael B 4
Sally A 1
Sally B 2
Sally C NaN

A Sankey requires a format like this:

course1 course2 course3 course4 count
A B 2
A C C B 1
C A B 1

I can't wrap my head around how to create columns for each level of order and to populate that with the values of course while also creating the count column that counts the number of students with that same sequence.

If I try df.groupby('order')['course'].count() then it returns the groups as rows, not columns like I need.

order
1.0    2682
2.0     578
3.0     197
4.0      89
5.0      27
6.0       8
7.0       1
Name: course, dtype: int64

It also doesn't create the sets of sequences that will need to populate the final table.

Can someone please help me reformat my long table into one with all of the counts of the sequences of the courses?

Any help is much appreciated.

Toy data:

student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})
NLR
  • 1,714
  • 2
  • 11
  • 21
  • 1
    To make sure that I'm tracking, I think of a sankey as source, target, value sets, similar to the plotly setup https://plotly.com/python/sankey-diagram/ . So do you want the population total in each course over 4 periods of time? – Docuemada Feb 19 '21 at 22:46
  • @Docuemada: that's a good point. You are exactly right: getting the data into source-target-count is what plotly needs. The Medium article (https://medium.com/kenlok/how-to-create-sankey-diagrams-from-dataframes-in-python-e221c1b4d6b0) has code that creates that format from the table I asked for, so I went in that direction. I would love help with getting to the source-target-count format, if you know how to do that. – NLR Feb 19 '21 at 23:08

1 Answers1

1

The number of steps could be a little less, but I created the following flow.

  1. Delete the Na value and add the course name column.
  2. Convert to horizontal format by course name
  3. Combine all course names in a string
  4. Aggregate by all course strings
  5. Combine the original data frame and the aggregated data frame
  6. Delete duplicate rows and rename columns
df.dropna(axis=0, how='any', inplace=True)
df['course_gp'] = df['order'].apply(lambda x: 'course' + str(int(x)))
df = df.pivot(index='student', columns='course_gp', values='course')
df.fillna('', inplace=True)
df['course_all'] = df['course1'] + df['course2'] + df['course3'] + df['course4']
dfc = df.groupby('course_all').count()
df = df.merge(dfc[['course1']], left_on='course_all', right_on='course_all', how='inner' )
df.drop_duplicates(keep='first', inplace=True)
df.rename({'course1_y':'count','course1_x':'course1'}, axis=1, inplace=True)
course1 course2 course3 course4 course_all count
0 A B AB 2
2 C A B CAB 1
3 A C C B ACCB 1
r-beginners
  • 31,170
  • 3
  • 14
  • 32
  • The pivot sets the index to email and returns this error: ValueError: Index contains duplicate entries, cannot reshape – NLR Feb 23 '21 at 20:11
  • I think the pivot will error out if there are duplicate indexes, so you will need to code in the data you actually want to deal with. Please add some sample data of the actual data structure. – r-beginners Feb 24 '21 at 00:27