Creating edges and nodes from dataframe for sankey diagram

Question

I've been using Python for a short while to create simple graphs which work fine and I am reasonably confident in that area. But now I have moved on to Sankey diagrams and here I am on deep waters with regards to creating nodes and edges.

I have a list of cases with the following associated values:

a case ID key
the decision
if the decision was a rejection an indication if it was in the bottom, middle or top third
a binary indication if it is a resubmission or not, a foreign key to the associated case ID and a year of the previous case

CaseID	Year	Decision	Reject grade	Resubmission	PreviousApplicationId	Year of prev case
1	2019	Rejection	Top	Yes	434	2018
2	2019	Accepted		No
3	2019	Rejection	Bottom	No
4	2019	Accepted		Yes	632	2018
5	2019	Rejection	Middle	Yes	485	2018
6	2020	Accepted		No
7	2020	Rejection	Top	Yes	5	2019
8	2020	Accepted		No
9	2020	Rejection	Middle	Yes	3	2019
10	2020	Accepted		No

(The above table can be generated thus:

df = pd.DataFrame( {'CaseID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}, 
'Year': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020}, 
'Decision': {0: 'Rejection', 1: 'Accepted', 2: 'Rejection', 3: 'Accepted', 4: 'Rejection', 5: 'Accepted', 6: 'Rejection', 7: 'Accepted', 8: 'Rejection', 9: 'Accepted'}, 
'Reject grade': {0: 'Top', 1: float('nan'), 2: 'Bottom', 3: float('nan'), 4: 'Middle', 5: float('nan'), 6: 'Top', 7: float('nan'), 8: 'Middle', 9: float('nan')}, 
'Resubmission': {0: 'Yes', 1: 'No', 2: 'No', 3: 'Yes', 4: 'Yes', 5: 'No', 6: 'Yes', 7: 'No', 8: 'Yes', 9: 'No'}, 
'PreviousApplicationId': {0: 434.0, 1: float('nan'), 2: float('nan'), 3: 632.0, 4: 485.0, 5: float('nan'), 6: 5.0, 7: float('nan'), 8: 3.0, 9: float('nan')}, 
'Year of prev case': {0: 2018.0, 1: float('nan'), 2: float('nan'), 3: 2018.0, 4: 2018.0, 5: float('nan'), 6: 2019.0, 7: float('nan'), 8: 2019.0, 9: float('nan')}} )

What I am trying to achieve is to see the flow from a case is submitted: Does it get rejected or accepted. If it is rejected, does it end in the top/middle/bottom pile. Does this result in a later resubmission? Overlaying all this is a timeflow, so it is possible to see how this flows over time. Some cases, if they get rejected, are not resubmitted. Others will be resubmitted, not in the following year, but maybe three years later.

I realise I have grasped over a problem that exceeds my current level of Python understanding haha

I used the wrapper function described at https://medium.com/kenlok/how-to-create-sankey-diagrams-from-dataframes-in-python-e221c1b4d6b0 which generated a sankey. So far so good. But unfortunately, the year column is a variable like so many others, so it ends up being shown on the far left.

I would like for time instead to be shown as an overlay on top, so that the flow also depicts the temporal element.

Image of the Sankey generated by the wrapper function mentioned

Code below:

import numpy as np
import pandas as pd
import plotly

df = pd.read_excel("data.xlsx")

def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
    # maximum of 6 value cols -> 6 colors
    colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
        
    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
    
    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum
        
    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
        
    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
    
    # creating the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = "black",
            width = 0.5
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count']
        )
      )
    
    layout =  dict(
        title = title,
        font = dict(
          size = 10
        )
    )
       
    fig = dict(data=[data], layout=layout)
    return fig

# Finanslovsaar = Year, Genansøgning = Resubmission, Afgørelse = Decision, Niveuakode = Grade
fig = genSankey(df,cat_cols=['Finanslovsaar', 'Genansøgning', 'Afgørelse', 'Niveaukode'],value_cols='count',title='Sankey Diagram')

plotly.offline.plot(fig, validate=False)
´´´

Creating edges and nodes from dataframe for sankey diagram

0 Answers0