Reformatting dataset for Sankey in Pandas

Question

I have my data in melted Pandas dataframe (code for data below):

student	course	order
Jerry	A	1
Jerry	B	2
Jerry	C	NaN
Jessy	C	1
Jessy	A	2
Jessy	B	3
Raphael	A	1
Raphael	C	2
Raphael	C	3
Raphael	B	4
Sally	A	1
Sally	B	2
Sally	C	NaN

A Sankey requires a format like this:

course1	course2	course3	course4	count
A	B			2
A	C	C	B	1
C	A	B		1

I can't wrap my head around how to create columns for each level of order and to populate that with the values of course while also creating the count column that counts the number of students with that same sequence.

If I try df.groupby('order')['course'].count() then it returns the groups as rows, not columns like I need.

order
1.0    2682
2.0     578
3.0     197
4.0      89
5.0      27
6.0       8
7.0       1
Name: course, dtype: int64

It also doesn't create the sets of sequences that will need to populate the final table.

Can someone please help me reformat my long table into one with all of the counts of the sequences of the courses?

Any help is much appreciated.

Toy data:

student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})

To make sure that I'm tracking, I think of a sankey as source, target, value sets, similar to the plotly setup https://plotly.com/python/sankey-diagram/ . So do you want the population total in each course over 4 periods of time? — Docuemada, Feb 19 '21 at 22:46
@Docuemada: that's a good point. You are exactly right: getting the data into source-target-count is what plotly needs. The Medium article (https://medium.com/kenlok/how-to-create-sankey-diagrams-from-dataframes-in-python-e221c1b4d6b0) has code that creates that format from the table I asked for, so I went in that direction. I would love help with getting to the source-target-count format, if you know how to do that. — NLR, Feb 19 '21 at 23:08

score 1 · Answer 1 · answered Feb 20 '21 at 04:33

The number of steps could be a little less, but I created the following flow.

Delete the Na value and add the course name column.
Convert to horizontal format by course name
Combine all course names in a string
Aggregate by all course strings
Combine the original data frame and the aggregated data frame
Delete duplicate rows and rename columns

df.dropna(axis=0, how='any', inplace=True)
df['course_gp'] = df['order'].apply(lambda x: 'course' + str(int(x)))
df = df.pivot(index='student', columns='course_gp', values='course')
df.fillna('', inplace=True)
df['course_all'] = df['course1'] + df['course2'] + df['course3'] + df['course4']
dfc = df.groupby('course_all').count()
df = df.merge(dfc[['course1']], left_on='course_all', right_on='course_all', how='inner' )
df.drop_duplicates(keep='first', inplace=True)
df.rename({'course1_y':'count','course1_x':'course1'}, axis=1, inplace=True)

	course1	course2	course3	course4	course_all	count
0	A	B			AB	2
2	C	A	B		CAB	1
3	A	C	C	B	ACCB	1

The pivot sets the index to email and returns this error: ValueError: Index contains duplicate entries, cannot reshape — NLR, Feb 23 '21 at 20:11
I think the pivot will error out if there are duplicate indexes, so you will need to code in the data you actually want to deal with. Please add some sample data of the actual data structure. — r-beginners, Feb 24 '21 at 00:27

Reformatting dataset for Sankey in Pandas

1 Answers1

Linked