7

I have spent some time answering How do I discretize a continuous function avoiding noise generation (see picture), and throughout, I felt like I was reinventing a bike.

Essentially, the problem is:

  • You are given a curve function - for any x, you can obtain y.
  • You want to approximate the curve using a piecewise-linear function with exactly N points, based on some error metric, e.g. distance to the curve, or minimize the absolute difference of the area under the curves (thanks to @QuangHoang for pointing out these are different).

Here's an example of a curve I approximated using 20 points: enter image description here

Question: I've coded this up using repeated bisections. Is there a library I could have used? Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?


Edit: upon request, here's how I've done it: Google Colab

Data:

import numpy as np
from scipy.signal import gaussian

N_MOCK = 2000

# A nice-ish mock distribution
xs = np.linspace(-10.0, 10.0, num=N_MOCK)
sigmoid = 1 / (1 + np.exp(-xs))
gauss = gaussian(N_MOCK, std=N_MOCK / 10)
ys = gauss - sigmoid + 1
xs += 10
xs /= 20

Plotting:

import matplotlib.pyplot as plt


def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
    """A simplified version of the provided plotting function"""
    
    # Setting Axis properties and titles
    fig, ax = plt.subplots(figsize=(20, 4))
    ax.set_title(plot_name)

    # Plotting stuff
    ax.plot(cont_time, cont_array, label="Continuous", color='#0000ff')
    ax.plot(disc_time, disc_array, label="Discrete",   color='#00ff00')

    fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

Here's how I solved it, but I hope there's a more standard way:

import warnings
warnings.simplefilter('ignore', np.RankWarning)


def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
    """Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
    straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
    xs = np.linspace(x0, x1, num=integral_points)
    ys = straight_line(xs)

    perfect_ys = ideal_line(xs)
    
    err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0)  # Remove (x1 - x0) to only look at avg errors
    return err


def discretize_bisect(xs, ys, bin_count):
    """Returns xs and ys of discrete points"""
    # For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
    # If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
    ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
    
    new_xs = [xs[0], xs[-1]]
    new_ys = [ys[0], ys[-1]]
    
    while len(new_xs) < bin_count:
        
        errors = []
        for i in range(len(new_xs)-1):
            err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
            errors.append(err)

        max_segment_id = np.argmax(errors)
        new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
        new_y = ideal_line(new_x)
        new_xs.insert(max_segment_id+1, new_x)
        new_ys.insert(max_segment_id+1, new_y)

    return new_xs, new_ys

Run:

BIN_COUNT = 25

new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)

plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))

Note: while I prefer numpy, the answer can be a library in any language, or the name of the mathematical term. Please do not write lots of code, as I have done that myself already :)

Morton
  • 160
  • 1
  • 10
  • can you provide a reproducible example? – mozway Dec 13 '21 at 16:43
  • 1
    I remember someone using a decision tree to approximate a curve; the interval ends up partitioned into leaves which are subintervals, and the value in each sub-interval is the average value over the interval. Upside: you can rely on a library like sklearn, no need to code it yourself. Downside: the result is a step-function (i.e., a non-continuous piecewise constant function), not a continuous piecewise affine function. – Stef Dec 13 '21 at 16:45
  • 1
    Unrealated, I have difficulty seeing *minimizing distance to the curve* is the same with *minimize the absolute difference of the area under the curve*. In minimizing the difference of the area under the curve, you may be looking for [Riemann sum](https://en.wikipedia.org/wiki/Riemann_sum#:~:text=In%20mathematics%2C%20a%20Riemann%20sum%20is%20a%20certain,also%20the%20length%20of%20curves%20and%20other%20approximations.) – Quang Hoang Dec 13 '21 at 16:51
  • @QuangHoang, thanks, you're right that I'm not rigorous enough here. As far as I understand, I've used a form of Riemann sum to estimate the error under the curve, but I am more concerned with, (hmm, how do I phrase it), selecting a specific number of points to approximate the curve, based on some error measure (e.g. Riemann sums). – Morton Dec 13 '21 at 16:57
  • 1
    Not just the number of points, but also how to distribute the points in the interval. Where the curve has a fast-changing slope, you might need lots of points; where the curve looks like a linear function, you need very few points. – Stef Dec 13 '21 at 17:27
  • 2
    The usual *approach* is some sort of spline, but that’s not a name for the problem itself. – Davis Herring Dec 13 '21 at 17:56
  • Thanks @Steff - both two very good points. And yes, reducing the number of bins / points in linear sections of the curve was exactly what I was going for. – Morton Dec 13 '21 at 18:14
  • 1
    Related questions: [Continuous Piecewise-Linear Fit in Python](https://stackoverflow.com/questions/22028529/continuous-piecewise-linear-fit-in-python); and this one looks really cool: [Fit piecewise linear data](https://dsp.stackexchange.com/questions/1227/fit-piecewise-linear-data) – Stef Dec 13 '21 at 21:38
  • 1
    If the chosen points have to be on the curve, then the problem is called _interpolation_. However, your questions, "Is there a library I could have used? Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?", are not really on-topic here. The first because software recommendations are off-topic; the others because they should be asked on [math.se] instead. – Peter O. Dec 14 '21 at 01:42
  • 2
    @PeterO. I've never seen the word "interpolation" used in that sense. I've always seen it used the other way around: if you already have the points, and want to fit a curve through these points, then that's interpolation. But here we already have the curve, and we want to find the points. – Stef Dec 14 '21 at 09:31
  • [Inverse transform sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) or sampling from cumulative distribution function may be what you are seeking. [This](https://stackoverflow.com/questions/61248310/generate-a-mesh-with-unequal-steps-based-on-a-density-function-using-matlab/61254341#61254341) may be a related post. – rahnema1 Dec 15 '21 at 15:45
  • @rahnema1 A more detailed explanation of how to apply inverse transform sampling to the OP's problem might be in order. I don't think the connection is obvious? – Stef Dec 15 '21 at 15:47
  • @Stef OP wants to estimate a curve with a fixed number of points. Take equally spaced points from cdf and tranform back the points provides the best. The second link contains such problem. – rahnema1 Dec 15 '21 at 15:52
  • 1
    Hi @Morton ; you can take a look at the curve simplification algorithm of [Ramer–Douglas–Peucker](https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm) – Ali Değer Ozbakir Feb 19 '22 at 19:00
  • Interesting. Thanks! Looking at the animation, it feels very similar to how I implemented it - it's a good source to check if I missed out on any optimisations. – Morton Feb 19 '22 at 19:04

1 Answers1

1

Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?

I know this problem as Expected Improvement (EI), or Bayesian optimization (permalink on archive.org). Given an expensive black box function for which you would like to find the global maximum, this algorithm yields the next position where to check for that maximum.

At first glance, this is different from your problem. You are looking for a way to approximate a curve with a small number of samples, while EI provides the places where the function has its most likely maximum. But both problems are equivalent insofar that you minimize an error function (which will change when you add another sample to your approximation) with the fewest possible points.

I believe this is the original research paper.

Jones, Donald & Schonlau, Matthias & Welch, William. (1998). Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization. 13. 455-492. 10.1023/A:1008306431147.

From section 1:

[...] the technique often requires the fewest function evaluations of all competing methods. This is possible because, with typical engineering functions, one can often interpolate and extrapolate quite accurately over large distances in the design space. Intuitively, the method is able to ‘see’ obvious trends or patterns in the data and ‘jump to conclusions’ instead of having to move step-by-step along some trajectory.

As to why it is efficient:

[...] the response surface approach provides a credible stopping rule based on the expected improvement from further searching. Such a stopping rule is possible because the statistical model provides confidence intervals on the function’s value at unsampled points – and the ‘reasonableness’ of these confidence intervals can be checked by model validation techniques.

sarema
  • 695
  • 5
  • 18
  • 1
    On a surface level, this, along with the comments under the question, gives an answer I was looking for. However, I want to do give this answer justice and verify it myself when I have the time. Realistically, in a few months time. – Morton Feb 19 '22 at 19:06