I have spent some time answering How do I discretize a continuous function avoiding noise generation (see picture), and throughout, I felt like I was reinventing a bike.
Essentially, the problem is:
- You are given a curve function - for any
x
, you can obtainy
. - You want to approximate the curve using a piecewise-linear function with exactly
N
points, based on some error metric, e.g. distance to the curve, or minimize the absolute difference of the area under the curves (thanks to @QuangHoang for pointing out these are different).
Here's an example of a curve I approximated using 20 points:
Question: I've coded this up using repeated bisections. Is there a library I could have used? Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?
Edit: upon request, here's how I've done it: Google Colab
Data:
import numpy as np
from scipy.signal import gaussian
N_MOCK = 2000
# A nice-ish mock distribution
xs = np.linspace(-10.0, 10.0, num=N_MOCK)
sigmoid = 1 / (1 + np.exp(-xs))
gauss = gaussian(N_MOCK, std=N_MOCK / 10)
ys = gauss - sigmoid + 1
xs += 10
xs /= 20
Plotting:
import matplotlib.pyplot as plt
def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
"""A simplified version of the provided plotting function"""
# Setting Axis properties and titles
fig, ax = plt.subplots(figsize=(20, 4))
ax.set_title(plot_name)
# Plotting stuff
ax.plot(cont_time, cont_array, label="Continuous", color='#0000ff')
ax.plot(disc_time, disc_array, label="Discrete", color='#00ff00')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
Here's how I solved it, but I hope there's a more standard way:
import warnings
warnings.simplefilter('ignore', np.RankWarning)
def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
"""Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
xs = np.linspace(x0, x1, num=integral_points)
ys = straight_line(xs)
perfect_ys = ideal_line(xs)
err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0) # Remove (x1 - x0) to only look at avg errors
return err
def discretize_bisect(xs, ys, bin_count):
"""Returns xs and ys of discrete points"""
# For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
# If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
new_xs = [xs[0], xs[-1]]
new_ys = [ys[0], ys[-1]]
while len(new_xs) < bin_count:
errors = []
for i in range(len(new_xs)-1):
err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
errors.append(err)
max_segment_id = np.argmax(errors)
new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
new_y = ideal_line(new_x)
new_xs.insert(max_segment_id+1, new_x)
new_ys.insert(max_segment_id+1, new_y)
return new_xs, new_ys
Run:
BIN_COUNT = 25
new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)
plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))
Note: while I prefer numpy
, the answer can be a library in any language, or the name of the mathematical term. Please do not write lots of code, as I have done that myself already :)