0

I've been using Python for a short while to create simple graphs which work fine and I am reasonably confident in that area. But now I have moved on to Sankey diagrams and here I am on deep waters with regards to creating nodes and edges.

I have a list of cases with the following associated values:

  • a case ID key
  • the decision
  • if the decision was a rejection an indication if it was in the bottom, middle or top third
  • a binary indication if it is a resubmission or not, a foreign key to the associated case ID and a year of the previous case
CaseID Year Decision Reject grade Resubmission PreviousApplicationId Year of prev case
1 2019 Rejection Top Yes 434 2018
2 2019 Accepted No
3 2019 Rejection Bottom No
4 2019 Accepted Yes 632 2018
5 2019 Rejection Middle Yes 485 2018
6 2020 Accepted No
7 2020 Rejection Top Yes 5 2019
8 2020 Accepted No
9 2020 Rejection Middle Yes 3 2019
10 2020 Accepted No

(The above table can be generated thus:

df = pd.DataFrame( {'CaseID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}, 
'Year': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020}, 
'Decision': {0: 'Rejection', 1: 'Accepted', 2: 'Rejection', 3: 'Accepted', 4: 'Rejection', 5: 'Accepted', 6: 'Rejection', 7: 'Accepted', 8: 'Rejection', 9: 'Accepted'}, 
'Reject grade': {0: 'Top', 1: float('nan'), 2: 'Bottom', 3: float('nan'), 4: 'Middle', 5: float('nan'), 6: 'Top', 7: float('nan'), 8: 'Middle', 9: float('nan')}, 
'Resubmission': {0: 'Yes', 1: 'No', 2: 'No', 3: 'Yes', 4: 'Yes', 5: 'No', 6: 'Yes', 7: 'No', 8: 'Yes', 9: 'No'}, 
'PreviousApplicationId': {0: 434.0, 1: float('nan'), 2: float('nan'), 3: 632.0, 4: 485.0, 5: float('nan'), 6: 5.0, 7: float('nan'), 8: 3.0, 9: float('nan')}, 
'Year of prev case': {0: 2018.0, 1: float('nan'), 2: float('nan'), 3: 2018.0, 4: 2018.0, 5: float('nan'), 6: 2019.0, 7: float('nan'), 8: 2019.0, 9: float('nan')}} )

What I am trying to achieve is to see the flow from a case is submitted: Does it get rejected or accepted. If it is rejected, does it end in the top/middle/bottom pile. Does this result in a later resubmission? Overlaying all this is a timeflow, so it is possible to see how this flows over time. Some cases, if they get rejected, are not resubmitted. Others will be resubmitted, not in the following year, but maybe three years later.

I realise I have grasped over a problem that exceeds my current level of Python understanding haha

I used the wrapper function described at https://medium.com/kenlok/how-to-create-sankey-diagrams-from-dataframes-in-python-e221c1b4d6b0 which generated a sankey. So far so good. But unfortunately, the year column is a variable like so many others, so it ends up being shown on the far left.

I would like for time instead to be shown as an overlay on top, so that the flow also depicts the temporal element.

Image of the Sankey generated by the wrapper function mentioned

Code below:

import numpy as np
import pandas as pd
import plotly

df = pd.read_excel("data.xlsx")

def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
    # maximum of 6 value cols -> 6 colors
    colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
        
    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
    
    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum
        
    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
        
    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
    
    # creating the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = "black",
            width = 0.5
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count']
        )
      )
    
    layout =  dict(
        title = title,
        font = dict(
          size = 10
        )
    )
       
    fig = dict(data=[data], layout=layout)
    return fig

# Finanslovsaar = Year, Genansøgning = Resubmission, Afgørelse = Decision, Niveuakode = Grade
fig = genSankey(df,cat_cols=['Finanslovsaar', 'Genansøgning', 'Afgørelse', 'Niveaukode'],value_cols='count',title='Sankey Diagram')

plotly.offline.plot(fig, validate=False)
´´´
Kasper
  • 3
  • 2

0 Answers0