I have data on the sequence of courses taken by students and I would like to represent the flows between classes using a Sankey diagram. My data is in a Pandas dataframe in a long format, where each step that someone took has a row and the order of those steps is specified by a column order
:
student | course | order |
---|---|---|
Jerry | A | 1 |
Jerry | B | 2 |
Jerry | C | NaN |
Jessy | C | 1 |
Jessy | A | 2 |
Jessy | B | 3 |
Raphael | A | 1 |
Raphael | C | 2 |
Raphael | C | 3 |
Raphael | B | 4 |
Sally | A | 1 |
Sally | B | 2 |
Sally | C | NaN |
I pivoted this table to aggregate it into sequences with the count of each sequence:
course1 | course2 | course3 | course4 | count |
---|---|---|---|---|
A | B | End | End | 2 |
A | C | C | B | 1 |
C | A | B | End | 1 |
Note that I want to retain the End
value, but if this causes problems, I am happy to abandon this and just have people stop at a step.
Building a Sankey in Plotly requires a data format with the source and target. Here is the example on the Plotly docs.
source = [0, 1, 0, 2, 3, 3],
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]
I need to get my data into the format above, but for my entire dataframe.
If I was dealing with a small dataset, like the toy one above, I could manually create this. However, I have a dataframe with thousands of rows and I have no idea how to do this in Pandas. It looks like some sort of window calculation, but I have no idea how to do it.
I am also kinda confused about how to create the coding scheme because my failed attempts create a viz that doesn't have a sequence of four courses.
Any help is much appreciated.
Toy data:
student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})