Optimal way to take a network of nodes and interpolate missing values

Question

I have an example data frame below:

data = [
    [1, 2, 100, 4342],
    [3, 4, 100, 999],
    [5, 6, 500, 4339],
    [4, 5, 300, 999],
    [12, 13, 100, 4390],
    [6, 7, 600, 4335],
    [2, 3, 200, 4341],
    [10,11, 100, 4400],
    [11,12, 200, 999],
    [7, 8, 200, 4332]
]
df = pd.DataFrame(data, columns = ['Node','Dwn_Node', 'Dwn_Length','Elevation'])
df = df.replace(999, np.nan)

Where the Node column describes the name of the current node and Dwn_Node describes the name of the node 'down stream'. Elevation describes the elevation of the current node and Dwn_Length describes the length to the 'down stream' node. I am really not sure of the best way to complete this, but the goal would be to interpolate the missing values using slope. I am thinking there might be a function or better capability in networkx but am very unfamiliar with that library.

The above data set is an example data set but is accurate in that the node order is out of place.

One way I thought of would be to separate the previous and subsequent nodes of the unknown nodes i.e.

data1 = [
    [12, 13, 100, 4390],
    [10,11, 100, 4400],
    [11,12, 200, 999]
]

Calculate slope from data1 by taking the sum of Dwn_Length of nodes 10 and 11 under the difference in elevation values of node 10 and 12 then apply that slope to interpolate the elevation of node 11 given the Dwn_Length of node 10. This seems very tedious for a data set that has many sets of missing node values within a network though.

Corralien · Answer 1 · 2023-06-01T21:43:14.310

You can probably enhanced the speed of the process but this should work:

import networkx as nx

# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='Node', target='Dwn_Node',
                            edge_attr='Dwn_Length', create_using=nx.DiGraph)
# nx.set_node_attributes(G, df.set_index('Node')[['Elevation']].to_dict())

# For each subgraph
for nbunch in nx.connected_components(G.to_undirected()):
    H = nx.subgraph(G, nbunch)
    roots = [n for n, d in H.in_degree if d == 0]
    leaves = [n for n, d in H.out_degree if d == 0]

    for root in roots:
        for leaf in leaves:
            for path in nx.all_simple_paths(H, root, leaf):
    
                # Extract and sort subgraph
                sort = lambda x: np.searchsorted(path, x)
                df1 = df[df['Node'].isin(path)].sort_values('Node', key=sort)
                df1['Distance'] = df1['Dwn_Length'].cumsum()

                # Piecewise linear interpolation
                m = df1['Elevation'].isna()
                x = df1.loc[m, 'Distance']
                xp = df1.loc[~m, 'Distance']
                yp = df1.loc[~m, 'Elevation']
                y = np.interp(x, xp, yp)
                df1.loc[m, 'Elevation'] = y
    
                # Update missing values
                df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])

IIUC, I'm not really sure you need networkx for this task. Interpolation is required to fill missing values but not linear because the distance between each point is not evenly spaced. You have to use piecewise linear interpolation. numpy provides the interp method. However you have to modify your input dataframe: sort it by node and compute the cumulative sum of Dwn_Length. After that you have all x (Distance) and y (Elevation) values to compute interpolation for missing values:

# Preparation
df1 = df.sort_values('Node').assign(Distance=lambda x: x['Dwn_Length'].cumsum())

# Piecewise linear interpolation
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)

# Visualization
df1.loc[m, 'Elevation'] = y
df1.plot(x='Distance', y='Elevation', ylabel='Elevation', marker='o', legend=False)
plt.show()

Output:

>>> df1
   Node  Dwn_Node  Dwn_Length    Elevation  Distance
0     1         2         100  4342.000000       100
6     2         3         200  4341.000000       300
1     3         4         100  4340.777778       400
3     4         5         300  4340.111111       700
2     5         6         500  4339.000000      1200
5     6         7         600  4335.000000      1800
9     7         8         200  4332.000000      2000
7    10        11         100  4400.000000      2100
8    11        12         200  4393.333333      2300
4    12        13         100  4390.000000      2400

Obviously, as your index has not changed from df to df1, you can fill missing values from df1:

df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])

Output:

>>> df
   Node  Dwn_Node  Dwn_Length    Elevation
0     1         2         100  4342.000000
1     3         4         100  4340.777778
2     5         6         500  4339.000000
3     4         5         300  4340.111111
4    12        13         100  4390.000000
5     6         7         600  4335.000000
6     2         3         200  4341.000000
7    10        11         100  4400.000000
8    11        12         200  4393.333333
9     7         8         200  4332.000000

It works because Node and Dwn_Node monotonically increase (1->2, 2->3, 3->4, etc). If your nodes have different names, you need to find a way to order them correctly. The rest of the method will still work. In this case, `networkx` can help you to find easily the ends of the directed graph. — Corralien, Jun 01 '23 at 06:31
In the case outside of the example dataset my ```Node``` and ```Dwn_Node``` do not monotonically increase ie (B7-D26 -> D09013, D09013 -> SD4902, SD4902 -> D028081) but ordering those is a question in and of itself. — rweber, Jun 01 '23 at 15:28
OK. I will update my answer later with `networkx` to find root and leaf. Do you have some case where you have A -> B, B -> C, B -> D? — Corralien, Jun 01 '23 at 18:06
Yes, there are cases where the network splits and or joins back but never loops. — rweber, Jun 01 '23 at 18:33
That worked well save for some extraneous values. I think I need to do some QA/QC on the invert data. The script did run into a ```ValueError: array of sample points is empty```. There are some instances where a path of the network has one or none known points which I think is what is causing this error. — rweber, Jun 06 '23 at 18:50
Yes, you're right. In this case, you just need to check if `xp` is empty to avoid this error. Do you have any other points that are not clear enough? — Corralien, Jun 06 '23 at 19:01
I integrated this using: ```if len(xp) == 0: continue ``` I need to look into what is causing the extraneous values. — rweber, Jun 06 '23 at 19:17
Or `if sum(m) == len(df1): continue` :) Obviously, the code above try to do an interpolation so this code doesn't work for extrapolation (when NaNs are on the edge) — Corralien, Jun 06 '23 at 19:22
It would be nice to share the base data in .csv or something. Ive got it plotted out in 3D and its really puzzling where these extraneous values are coming from. — rweber, Jun 06 '23 at 19:28
What's wrong with your plot? You can use wetransfer or dropbox or google drive or github to share your code and your data. I'll take a look tomorrow if I have time. — Corralien, Jun 06 '23 at 19:47
It looks as though the interpolation scheme fails if there is more than one point missing elevation values between two known values within a network path. — rweber, Jun 06 '23 at 20:04
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253975/discussion-between-rweber-and-corralien). — rweber, Jun 06 '23 at 20:20
I have yet to find a solution I did post a link to the data in the chat though. — rweber, Jun 13 '23 at 16:14

Optimal way to take a network of nodes and interpolate missing values

1 Answers1