I've been using Python for a short while to create simple graphs which work fine and I am reasonably confident in that area. But now I have moved on to Sankey diagrams and here I am on deep waters with regards to creating nodes and edges.
I have a list of cases with the following associated values:
- a case ID key
- the decision
- if the decision was a rejection an indication if it was in the bottom, middle or top third
- a binary indication if it is a resubmission or not, a foreign key to the associated case ID and a year of the previous case
CaseID | Year | Decision | Reject grade | Resubmission | PreviousApplicationId | Year of prev case |
---|---|---|---|---|---|---|
1 | 2019 | Rejection | Top | Yes | 434 | 2018 |
2 | 2019 | Accepted | No | |||
3 | 2019 | Rejection | Bottom | No | ||
4 | 2019 | Accepted | Yes | 632 | 2018 | |
5 | 2019 | Rejection | Middle | Yes | 485 | 2018 |
6 | 2020 | Accepted | No | |||
7 | 2020 | Rejection | Top | Yes | 5 | 2019 |
8 | 2020 | Accepted | No | |||
9 | 2020 | Rejection | Middle | Yes | 3 | 2019 |
10 | 2020 | Accepted | No |
(The above table can be generated thus:
df = pd.DataFrame( {'CaseID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Year': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020},
'Decision': {0: 'Rejection', 1: 'Accepted', 2: 'Rejection', 3: 'Accepted', 4: 'Rejection', 5: 'Accepted', 6: 'Rejection', 7: 'Accepted', 8: 'Rejection', 9: 'Accepted'},
'Reject grade': {0: 'Top', 1: float('nan'), 2: 'Bottom', 3: float('nan'), 4: 'Middle', 5: float('nan'), 6: 'Top', 7: float('nan'), 8: 'Middle', 9: float('nan')},
'Resubmission': {0: 'Yes', 1: 'No', 2: 'No', 3: 'Yes', 4: 'Yes', 5: 'No', 6: 'Yes', 7: 'No', 8: 'Yes', 9: 'No'},
'PreviousApplicationId': {0: 434.0, 1: float('nan'), 2: float('nan'), 3: 632.0, 4: 485.0, 5: float('nan'), 6: 5.0, 7: float('nan'), 8: 3.0, 9: float('nan')},
'Year of prev case': {0: 2018.0, 1: float('nan'), 2: float('nan'), 3: 2018.0, 4: 2018.0, 5: float('nan'), 6: 2019.0, 7: float('nan'), 8: 2019.0, 9: float('nan')}} )
What I am trying to achieve is to see the flow from a case is submitted: Does it get rejected or accepted. If it is rejected, does it end in the top/middle/bottom pile. Does this result in a later resubmission? Overlaying all this is a timeflow, so it is possible to see how this flows over time. Some cases, if they get rejected, are not resubmitted. Others will be resubmitted, not in the following year, but maybe three years later.
I realise I have grasped over a problem that exceeds my current level of Python understanding haha
I used the wrapper function described at https://medium.com/kenlok/how-to-create-sankey-diagrams-from-dataframes-in-python-e221c1b4d6b0 which generated a sankey. So far so good. But unfortunately, the year column is a variable like so many others, so it ends up being shown on the far left.
I would like for time instead to be shown as an overlay on top, so that the flow also depicts the temporal element.
Image of the Sankey generated by the wrapper function mentioned
Code below:
import numpy as np
import pandas as pd
import plotly
df = pd.read_excel("data.xlsx")
def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
# maximum of 6 value cols -> 6 colors
colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
labelList = []
colorNumList = []
for catCol in cat_cols:
labelListTemp = list(set(df[catCol].values))
colorNumList.append(len(labelListTemp))
labelList = labelList + labelListTemp
# remove duplicates from labelList
labelList = list(dict.fromkeys(labelList))
# define colors based on number of levels
colorList = []
for idx, colorNum in enumerate(colorNumList):
colorList = colorList + [colorPalette[idx]]*colorNum
# transform df into a source-target pair
for i in range(len(cat_cols)-1):
if i==0:
sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
sourceTargetDf.columns = ['source','target','count']
else:
tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
tempDf.columns = ['source','target','count']
sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
# add index for source-target pair
sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
# creating the sankey diagram
data = dict(
type='sankey',
node = dict(
pad = 15,
thickness = 20,
line = dict(
color = "black",
width = 0.5
),
label = labelList,
color = colorList
),
link = dict(
source = sourceTargetDf['sourceID'],
target = sourceTargetDf['targetID'],
value = sourceTargetDf['count']
)
)
layout = dict(
title = title,
font = dict(
size = 10
)
)
fig = dict(data=[data], layout=layout)
return fig
# Finanslovsaar = Year, Genansøgning = Resubmission, Afgørelse = Decision, Niveuakode = Grade
fig = genSankey(df,cat_cols=['Finanslovsaar', 'Genansøgning', 'Afgørelse', 'Niveaukode'],value_cols='count',title='Sankey Diagram')
plotly.offline.plot(fig, validate=False)
´´´