3

I have a graph with the main data points(blue line) and the maxima (green) and minima (red). enter image description here

Note that the x-values of the minima and maxima values are not the same, nor are they guaranteed to have the same count of values.

Now my goal is to determine when the distance along the y-axis (integral? sorry, its been a while since calculus in uni) between the maxima and minima lines gets below 10% (or any other arbitrary threshold) from the the average distance along the y-axis.

Here is the code used to generate:

# Finding the min and max
c_max_index = argrelextrema(df.flow.values, np.greater, order=3)
c_min_index = argrelextrema(df.flow.values, np.less, order=3)

df['min_extreme'] = df.flow[c_min_index[0]]
df['max_extreme'] = df.flow[c_max_index[0]]

# Plotting the values for the graph above
plt.plot(df.flow.values)
upper_bound = plt.plot(c_max_index[0], df.flow.values[c_max_index[0]], linewidth=0.8, c='g')
lower_bound = plt.plot(c_min_index[0], df.flow.values[c_min_index[0]], linewidth=0.8, c='r')

If it makes a difference, I'm using a Pandas Dataframe, scipy, and matplotlib.

Andrew Graham-Yooll
  • 2,148
  • 4
  • 24
  • 49
  • 1
    could you please add some data? – mortysporty May 30 '19 at 15:19
  • Can you define "the average distance along the y-axis"? What about the leftmost (rightmost) part of min_extreme (max_extreme) which has no counterparty on the max_extreme (min_extreme)? Do you simply want to ignore? – Yi Bao May 30 '19 at 15:27
  • @YiBao Yes, those can be ignored, as the dataset is actually much larger and will be trimmed evenly. Average distance can be defined as: (the distance from the green line - red line) for every x-value on the blue line / total number of points on blue line – Andrew Graham-Yooll May 30 '19 at 15:35

3 Answers3

1

If I understand your question correct, you basically want to interpolate the lines defined by the extreme values. Stealing the answer from this post Interpolate NaN values in a numpy array, you can do this

# Finding the min and max
c_max_index = argrelextrema(df.flow.values, np.greater, order=3)
c_min_index = argrelextrema(df.flow.values, np.less, order=3)

df['min_extreme'] = df.flow[c_min_index[0]]
df['max_extreme'] = df.flow[c_max_index[0]]

# Interpolate so you get no 'nan' values
df['min_extreme'] = df['min_extreme'].interpolate()
df['max_extreme'] = df['max_extreme'].interpolate() 

From here it should be easy do to all kinds of stuff with the distancens between the two lines. For instance

# Get the average distance between the upper and lower extrema-lines
df['distance'] = df['max_extreme'] - df['min_extreme']
avg_dist = np.mean(df['distance'])

# Find indexes where distance is within some tolerance
df.index[df['distance']< avg_dist * .95]
mortysporty
  • 2,749
  • 6
  • 28
  • 51
1

This is a perfect solution by no means. It aims to give you some ideas of how it can be done since there is no more data.

The major problem you are trying to solve is to deal with two piece-wise straight lines. And the pieces do not align. An obvious solution is to interpolate both and obtain a union of x's. Then the calculation of distances is easier.

import numpy as np
import matplotlib.pyplot as plt

# Toy data
x1 = [0, 1, 2, 3, 4, 5, 6]
y1 = [9, 8, 9, 10, 7, 6, 9]
x2 = [0.5, 3, 5, 6, 9]
y2 = [0, 1, 3, 2, 1]

# Interpolation for both lines
points1 = list(zip(x1, y1))
y1_interp = np.interp(x2, x1, y1)
interp_points1 = list(zip(x2, y1_interp))
l1 = list(set(points1 + interp_points1))
all_points1 = sorted(l1, key = lambda x: x[0])

points2 = list(zip(x2, y2))
y2_interp = np.interp(x1, x2, y2)
interp_points2 = list(zip(x1, y2_interp))
l2 = list(set(points2 + interp_points2))
all_points2 = sorted(l2, key = lambda x: x[0])

assert(len(all_points1) == len(all_points2))

# Since I do not have data points on the blue line, 
# I will calculate the average distance based on x's of all interpolated points
sum_d = 0
for i in range(len(all_points1)):
    sum_d += all_points1[i][1] - all_points2[i][1]
avg_d = sum_d / len(all_points1)
threshold = 0.5
d_threshold = avg_d * threshold

for i in range(len(all_points1)):
    d = all_points1[i][1] - all_points2[i][1]
    if d / avg_d < threshold:
        print("Distance below threshold between", all_points1[i], "and", all_points2[i])

Notice that np.interp extrapolate values as well, but they do not participate in the calculation.

Now there is a remaining question: if you actually need to know when the distance falls below threshold other than the interpolated points only, one needs to analytically search for the first and last points in each piece of lines. Here is a piece of sample:

for i in range(len(all_points1) - 1):
    (pre_x1, pre_y1) = all_points1[i]
    (post_x1, post_y1) = all_points1[i + 1]
    (pre_x2, pre_y2) = all_points2[i]
    (post_x2, post_y2) = all_points2[i + 1]
    # Skip the pieces that will never have qualified points
    if (pre_y1 - pre_y2) / avg_d >= threshold and (post_y1 - post_y2) / avg_d >= threshold:
        continue
    k1 = (post_y1 - pre_y1) / (post_x1 - pre_x1)
    b1 = (post_x1 * pre_y1 - pre_x1 * post_y1) / (post_x1 - pre_x1)
    k2 = (post_y2 - pre_y2) / (post_x2 - pre_x2)
    b2 = (post_x2 * pre_y2 - pre_x2 * post_y2) / (post_x2 - pre_x2)
    x_start = (d_threshold - b1 + b2) / (k1 - k2)
    print("The first point where the distance falls below threshold is at x=", x_start)
    break
Yi Bao
  • 165
  • 2
  • 15
1

Your problem is that min_extreme and max_extreme are not aligned/defined all the way. We can solve it by interpolate:

# this will interpolate values linearly, i.e data on the upper and lower lines
df = df.interpolate()

# vertical distance between upper and lower lines:
df['dist'] = df.max_extreme - df.min_extreme

# thresholding, thresh can be scalar or series
# thresh = 0.5 -- absolute value
# thresh = df.max_extreme / 2 -- relative to the current max_extreme

thresh = df.dist.quantile(0.5) # larger than 50% of the distances

df['too_far'] = df.dist.gt(thresh)

# visualize:
tmp_df = df[df.too_far]

upper_bound = plt.plot(c_max_index[0], df.flow.values[c_max_index[0]], linewidth=0.8, c='g')
lower_bound = plt.plot(c_min_index[0], df.flow.values[c_min_index[0]], linewidth=0.8, c='r')

df.flow.plot()

plt.scatter(tmp_df.index, tmp_df.min_extreme, s=10)
plt.scatter(tmp_df.index, tmp_df.max_extreme, s=10)
plt.show()

Output:

enter image description here

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74