0

I have a continuous input function which I would like to discretize into lets say 5-10 discrete bins between 1 and 0. Right now I am using np.digitize and rescale the output bins to 0-1. Now the problem is that sometime datasets (blue line) yield results like this:

Ugly/noisy discretization where especially in the beginning the code creates unnecessary noise

I tried pushing up the number of discretization bins but I ended up keeping the same noise and getting just more increments. As an example where the algorithm worked with the same settings but another dataset:

Same code but more desired results

this is the code I used there NumOfDisc = number of bins

intervals = np.linspace(0,1,NumOfDisc)
discretized_Array = np.digitize(Continuous_Array, intervals)

The red ilne in the graph is not important. The continuous blue line is the on I try to discretize and the green line is the discretized result.The Graphs are created with matplotlyib.pyplot using the following code:

def CheckPlots(discretized_Array, Continuous_Array, Temperature, time, PlotName)
logging.info("Plotting...")

#Setting Axis properties and titles
fig, ax = plt.subplots(1, 1)
ax.set_title(PlotName)
ax.set_ylabel('Temperature [°C]')
ax.set_ylim(40, 110)
ax.set_xlabel('Time [s]')    
ax.grid(b=True, which="both")
ax2=ax.twinx()
ax2.set_ylabel('DC Power [%]')
ax2.set_ylim(-1.5,3.5)

#Plotting stuff
ax.plot(time, Temperature, label= "Input Temperature", color = '#c70e04')
ax2.plot(time, Continuous_Array, label= "Continuous Power", color = '#040ec7')
ax2.plot(time, discretized_Array, label= "Discrete Power", color = '#539600')

fig.legend(loc = "upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

logging.info("Done!")
logging.info("---")
return 

Any Ideas what I could do to get sensible discretizations like in the second case?

2 Answers2

0

If what I described in the comments is the problem, there are a few options to deal with this:

  1. Do nothing: Depending on the reason you're discretizing, you might want the discrete values to reflect the continuous values accurately
  2. Change the bins: you could shift the bins or change the number of bins, such that relatively 'flat' parts of the blue line stay within one bin, thus giving a flat green line in these parts as well, which would be visually more pleasing like in your second plot.
Andre
  • 760
  • 3
  • 13
  • 1. Is no an option because the values need to be discretized 2. I treid this one but for some reason incresing the number of bins did not help... Right now I'm trying a new idea where I first hardcode the two constant lines in the beginning and the end and then I try to use the np.digitize function only on the remaining dynamic part inbetween the two constant values – Desperate Python Beginner Dec 15 '21 at 06:57
  • Sorry, maybe I didn't explain the 1st option well, but what I meant was discretize the way you did, **then do nothing** else and accept that the method gives you a shaky green line. I didn't mean: "do not discretize it". – Andre Dec 15 '21 at 09:00
  • Also, I see that Morton's solution works very well, but it does not do the same as mapping the y-values of a continuous function into X bins. If Morton's solution is indeed what you wanted, great! If not I could update my answer to explain what I mean in more detail. Let me know! – Andre Dec 15 '21 at 09:23
  • Actually you are right. Mortons solution is great and it'ts very elaborate and extensive but it does not really map the continuous input into discrete bins. I spend sime time thinking about how to further improve this and I hardcoded the constant regions in the beginning as well as in the end since these are always the same (by design of the experiment) and applied my discretization method. The result was a bit better but still not perfect. – Desperate Python Beginner Dec 17 '21 at 13:16
  • 1
    I then changed in the code how the intervals are created to `intervals = np.arange(min,max,0.05)` where min and max are the highest and lowest value and 0.05 is the stepsize. – Desperate Python Beginner Dec 17 '21 at 13:16
0

The following solution gives the exact result you need.

Basically, the algorithm finds an ideal line, and attempts to replicate it as well as it can with less datapoints. It starts with 2 points at the edges (straight line), then adds one in the center, then checks which side has the greatest error, and adds a point in the center of that, and so on, until it reaches the desired bin count. Simple :)

import warnings
warnings.simplefilter('ignore', np.RankWarning)


def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
    """Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
    straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
    xs = np.linspace(x0, x1, num=integral_points)
    ys = straight_line(xs)

    perfect_ys = ideal_line(xs)
    
    err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0)  # Remove (x1 - x0) to only look at avg errors
    return err


def discretize_bisect(xs, ys, bin_count):
    """Returns xs and ys of discrete points"""
    # For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
    # If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
    ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
    
    new_xs = [xs[0], xs[-1]]
    new_ys = [ys[0], ys[-1]]
    
    while len(new_xs) < bin_count:
        
        errors = []
        for i in range(len(new_xs)-1):
            err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
            errors.append(err)

        max_segment_id = np.argmax(errors)
        new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
        new_y = ideal_line(new_x)
        new_xs.insert(max_segment_id+1, new_x)
        new_ys.insert(max_segment_id+1, new_y)

    return new_xs, new_ys


BIN_COUNT = 25

new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)

plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))

Moreover, here's my simplified plotting function I tested with.

def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
    """A simplified version of the provided plotting function"""
    
    # Setting Axis properties and titles
    fig, ax = plt.subplots(figsize=(20, 4))
    ax.set_title(plot_name)
    ax.set_xlabel('Time [s]')
    ax.set_ylabel('DC Power [%]')

    # Plotting stuff
    ax.plot(cont_time, cont_array, label="Continuous Power", color='#0000ff')
    ax.plot(disc_time, disc_array, label="Discrete Power",   color='#00ff00')

    fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

Lastly, here's the Google Colab

Morton
  • 160
  • 1
  • 10