Format data in Pandas for multi-level Sankey in Plotly: source and target columns

Question

I have data on the sequence of courses taken by students and I would like to represent the flows between classes using a Sankey diagram. My data is in a Pandas dataframe in a long format, where each step that someone took has a row and the order of those steps is specified by a column order:

student	course	order
Jerry	A	1
Jerry	B	2
Jerry	C	NaN
Jessy	C	1
Jessy	A	2
Jessy	B	3
Raphael	A	1
Raphael	C	2
Raphael	C	3
Raphael	B	4
Sally	A	1
Sally	B	2
Sally	C	NaN

I pivoted this table to aggregate it into sequences with the count of each sequence:

course1	course2	course3	course4	count
A	B	End	End	2
A	C	C	B	1
C	A	B	End	1

Note that I want to retain the End value, but if this causes problems, I am happy to abandon this and just have people stop at a step.

Building a Sankey in Plotly requires a data format with the source and target. Here is the example on the Plotly docs.

source = [0, 1, 0, 2, 3, 3],
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]

I need to get my data into the format above, but for my entire dataframe.

If I was dealing with a small dataset, like the toy one above, I could manually create this. However, I have a dataframe with thousands of rows and I have no idea how to do this in Pandas. It looks like some sort of window calculation, but I have no idea how to do it.

I am also kinda confused about how to create the coding scheme because my failed attempts create a viz that doesn't have a sequence of four courses.

Any help is much appreciated.

Toy data:

student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})

I thought it was a new question, but what is different about [this question](https://stackoverflow.com/questions/66285745/reformatting-dataset-for-sankey-in-pandas)? — r-beginners, Feb 25 '21 at 02:35
@r-beginners: they are different questions that ask for different solutions. — NLR, Feb 25 '21 at 17:55
I suggest that you actually read the questions. Once you do, you'll see that they ask for different output tables. — NLR, Feb 26 '21 at 20:05
Is it that 'End' has been added? Then just change it to `df.fillna('End',inplace=True)`. — r-beginners, Feb 27 '21 at 02:10
No. If you read the question, you will see bold letters that specify the desired end result. "Building a Sankey in Plotly requires a data format with the source and target. Here is the example on the Plotly docs. source = [0, 1, 0, 2, 3, 3], target = [2, 3, 3, 4, 4, 5], value = [8, 4, 2, 8, 4, 2] I need to get my data into the format above, but for my entire dataframe." — NLR, Feb 27 '21 at 17:20
Was it for Sankey diagrams? By the way, the previous question was about data tableting, is my answer acceptable? — r-beginners, Feb 28 '21 at 02:17

Format data in Pandas for multi-level Sankey in Plotly: source and target columns

0 Answers0