I have my data in melted Pandas dataframe (code for data below):
student | course | order |
---|---|---|
Jerry | A | 1 |
Jerry | B | 2 |
Jerry | C | NaN |
Jessy | C | 1 |
Jessy | A | 2 |
Jessy | B | 3 |
Raphael | A | 1 |
Raphael | C | 2 |
Raphael | C | 3 |
Raphael | B | 4 |
Sally | A | 1 |
Sally | B | 2 |
Sally | C | NaN |
A Sankey requires a format like this:
course1 | course2 | course3 | course4 | count |
---|---|---|---|---|
A | B | 2 | ||
A | C | C | B | 1 |
C | A | B | 1 |
I can't wrap my head around how to create columns for each level of order
and to populate that with the values of course
while also creating the count
column that counts the number of students with that same sequence.
If I try df.groupby('order')['course'].count()
then it returns the groups as rows, not columns like I need.
order
1.0 2682
2.0 578
3.0 197
4.0 89
5.0 27
6.0 8
7.0 1
Name: course, dtype: int64
It also doesn't create the sets of sequences that will need to populate the final table.
Can someone please help me reformat my long table into one with all of the counts of the sequences of the courses?
Any help is much appreciated.
Toy data:
student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})