1

I have data on the sequence of courses taken by students and I would like to represent the flows between classes using a Sankey diagram. My data is in a Pandas dataframe in a long format, where each step that someone took has a row and the order of those steps is specified by a column order:

student course order
Jerry A 1
Jerry B 2
Jerry C NaN
Jessy C 1
Jessy A 2
Jessy B 3
Raphael A 1
Raphael C 2
Raphael C 3
Raphael B 4
Sally A 1
Sally B 2
Sally C NaN

I pivoted this table to aggregate it into sequences with the count of each sequence:

course1 course2 course3 course4 count
A B End End 2
A C C B 1
C A B End 1

Note that I want to retain the End value, but if this causes problems, I am happy to abandon this and just have people stop at a step.

Building a Sankey in Plotly requires a data format with the source and target. Here is the example on the Plotly docs.

source = [0, 1, 0, 2, 3, 3],
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]

I need to get my data into the format above, but for my entire dataframe.

If I was dealing with a small dataset, like the toy one above, I could manually create this. However, I have a dataframe with thousands of rows and I have no idea how to do this in Pandas. It looks like some sort of window calculation, but I have no idea how to do it.

I am also kinda confused about how to create the coding scheme because my failed attempts create a viz that doesn't have a sequence of four courses.

Any help is much appreciated.

Toy data:

student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})
NLR
  • 1,714
  • 2
  • 11
  • 21
  • I thought it was a new question, but what is different about [this question](https://stackoverflow.com/questions/66285745/reformatting-dataset-for-sankey-in-pandas)? – r-beginners Feb 25 '21 at 02:35
  • @r-beginners: they are different questions that ask for different solutions. – NLR Feb 25 '21 at 17:55
  • Can you please explain what is different? – r-beginners Feb 26 '21 at 01:43
  • I suggest that you actually read the questions. Once you do, you'll see that they ask for different output tables. – NLR Feb 26 '21 at 20:05
  • Is it that 'End' has been added? Then just change it to `df.fillna('End',inplace=True)`. – r-beginners Feb 27 '21 at 02:10
  • No. If you read the question, you will see bold letters that specify the desired end result. "Building a Sankey in Plotly requires a data format with the source and target. Here is the example on the Plotly docs. source = [0, 1, 0, 2, 3, 3], target = [2, 3, 3, 4, 4, 5], value = [8, 4, 2, 8, 4, 2] I need to get my data into the format above, but for my entire dataframe." – NLR Feb 27 '21 at 17:20
  • Was it for Sankey diagrams? By the way, the previous question was about data tableting, is my answer acceptable? – r-beginners Feb 28 '21 at 02:17

0 Answers0