2

Let the values in the array A be sampled from a Gaussian distribution. I want to replace every value in A with one of n_R "representatives" in R so that the total quantization error is minimized.

Here is NumPy code that does linear quantization:

n_A, n_R = 1_000_000, 256
mu, sig = 500, 250
A = np.random.normal(mu, sig, size = n_A)
lo, hi = np.min(A), np.max(A)
R = np.linspace(lo, hi, n_R)
I = np.round((A - lo) * (n_R - 1) / (hi - lo)).astype(np.uint32)

L = np.mean(np.abs(A - R[I]))
print('Linear loss:', L)
-> Linspace loss: 2.3303939600700603

While this works, the quantization error is large. Is there a smarter way to do it? I'm thinking that one could take advantage of A being normally distributed or perhaps use an iterative process that minimizes the "loss" function.

Update While researching this question, I found a related question about "weighting" the quantization. Adapting their method sometimes gives better quantization results:

from scipy.stats import norm

dist = norm(loc = mu, scale = sig)
bounds = dist.cdf([mu - 3*sig, mu + 3*sig])
pp = np.linspace(*bounds, n_R)
R = dist.ppf(pp)

# Find closest matches
lhits = np.clip(np.searchsorted(R, A, 'left'), 0, n_R - 1)
rhits = np.clip(np.searchsorted(R, A, 'right') - 1, 0, n_R - 1)

ldiff = R[lhits] - A
rdiff = A - R[rhits]
I = lhits
idx = np.where(rdiff < ldiff)[0]
I[idx] = rhits[idx]

L = np.mean(np.abs(A - R[I]))
print('Gaussian loss:', L)
-> Gaussian loss: 1.6521974945326285

K-means clustering might be better but seem to be too slow to be practical on large arrays.

Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
  • 1
    Sounds vaguely like it should be possible to rephrase this as an integer linear programming (ILP) problem. – Homer512 Jul 30 '23 at 17:29
  • 1
    @Reinderien I updated my question. `R` is not guaranteed to be linear. Linear quantization is my baseline and I'm trying to figure out if there are better ways to quantize normally distributed data. – Björn Lindqvist Jul 30 '23 at 22:19
  • Are you trying to compute an optimal `R` array? If not, what are you trying to compute? – user2357112 Jul 30 '23 at 22:22
  • 1
    Exactly. I'm trying to find values for `R` that minimizes the "encoding error". – Björn Lindqvist Jul 30 '23 at 22:28

2 Answers2

3

K-means

K-means clustering might be better but seem to be too slow to be practical on large arrays.

For the 1D clustering case, there are algorithms faster than K-means. See https://stats.stackexchange.com/questions/40454/determine-different-clusters-of-1d-data-from-database

I picked one of those algorithms, Jenks Natural Breaks, and ran it on a random sub-sample of your dataset:

A_samp = np.random.choice(A, size=10000)
breaks = np.array(jenkspy.jenks_breaks(A_samp, n_classes=n_R))
R = (breaks[:-1] + breaks[1:]) / 2

This is pretty fast, and gets a quantization loss for the full dataset of about 1.28.

To visualize what each of these methods are doing, I plotted the cdf of the breaks that each of them come up with against the index within R of the break.

QQ plot

Gaussian is a straight line, by definition. This means that it has an equal number of breaks at every percentile of the distribution. The linear method spends very little of its breaks in the middle of the distribution, and uses most of them at the tails. Jenks finds a compromise between the two of them.

Automatically searching for lower loss

Looking at the chart above, I had an idea: all of these methods of choosing breaks are sigmoid-shaped curves of various sorts when plotted in the quantile domain. (Gaussian sort of fits if you think of it as a really stretched out sigmoid.)

I wrote a function which parameterized each of those curves using a single variable, strength, which is how fast the sigmoid should curve. Once I had that, I used scipy.optimize.minimize to automatically search for a curve which minimized the loss.

It turns out that if you let Scipy optimize this, it picks a curve strength really close to Jenks, and the curve it finds is slightly worse than the Jenks one, with a loss of about 1.33.

You can see the notebook with this failed approach here.

Quantizing with 2^16 floats

In the case where you need to create 2^16 different representatives, it's computationally infeasible to use Jenks. However, you can do something that's pretty close: Jenks with a small number of classes plus linear interpolation.

Here's the code for this:

import itertools


def pairwise(iterable):
    "s -> (s0, s1), (s1, s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)


def linspace_jenks(A, n_R, jenks_classes, dist_lo, dist_hi):
    assert n_R % jenks_classes == 0, "jenks_classes must be divisor of n_R"
    simplify_factor = n_R // jenks_classes
    assert jenks_classes ** 2 <= len(A), "Need more data to estimate"
    breaks = jenkspy.jenks_breaks(A, n_classes=jenks_classes)
    # Adjust lowest and highest break to match highest/lowest observed value
    breaks[0] = dist_lo
    breaks[-1] = dist_hi
    linspace_classes = []
    for lo, hi in pairwise(breaks):
        linspace_classes.append(np.linspace(lo, hi, simplify_factor, endpoint=False))
    linspace_classes = np.hstack(linspace_classes)
    assert len(linspace_classes) == n_R
    return linspace_classes

Example call:

A_samp = np.random.choice(A, size = 2**16)
jenks_R = linspace_jenks(A_samp, n_R, 128, np.min(A), np.max(A))

How does the performance compare to the linear method? On my system, I get a loss of 0.009421 for linear with n_R=2^16. The following graph shows the losses that the linspace_jenks method gets for each value of jenks_classes.

linspace jenks loss

With just 32 Jenks classes, and filling the rest in with linear interpolation, the loss goes down to 0.005031.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Super interesting! I have to investigate this more. One comment though. Since Jenks yields the class boundaries shouldn't you compute the centroids like this: `(breaks[:-1] + breaks[1:]) / 2`? – Björn Lindqvist Jul 31 '23 at 04:28
  • @BjörnLindqvist Hmm, you're right. Using centroids instead of breaks seems to reduce the loss by ~0.015. – Nick ODell Jul 31 '23 at 15:39
  • Unfortunately, the Jenks algorithm is only tractable when the number of classes is small. My use case is quantizing floats to 16 bits which requires 65536 classes and is too much for Jenks. Very interesting approach nonetheless. – Björn Lindqvist Aug 06 '23 at 14:20
  • @BjörnLindqvist See edit. Using Jenks with 32 classes is enough to capture the shape of the distribution. – Nick ODell Aug 06 '23 at 20:30
2

Partially for novelty and mostly for completeness, I demonstrate what @Homer512 correctly suggests is possible - a MILP implementation. I expect its accuracy to be excellent and its performance to be somewhere between "poor" and "gruesome".

I demonstrate with a very small problem size so that when you debug and view the constraint matrices they are legible, and so that my RAM doesn't explode.

import time

import numpy as np
import scipy.sparse as sp
from numpy.random import default_rng
from scipy.optimize import milp, Bounds, LinearConstraint

n_A, n_R = 10, 4
rand = default_rng(seed=0)
A = rand.normal(loc=100, scale=50, size=n_A)
lo, hi = A.min(), A.max()

A_order = A.argsort()  # for use if you want to restore the original order later
# A = A[A_order]       # if you want to modify the algorithm to assume ordered input

'''
decision variables:
    nA*nR binary assignments (sparse)
    nA discretized values (dense)
    nR discretization levels (dense) aka. R
    nA errors (dense)
'''
c = np.zeros(n_A*n_R + n_A + n_R + n_A)
c[-n_A:] = 1  # minimize errors

integrality = np.ones(n_A*n_R + n_A + n_R + n_A, dtype=np.uint8)
integrality[n_A*n_R:] = 0  # only assignments are integral

lb = np.full_like(c, -np.inf)
ub = np.full_like(c, +np.inf)
lb[:n_A*n_R] = 0  # assignments are binary
ub[:n_A*n_R] = 1

# Big-M magnitude based on range of input data, without assuming signs
M = 2*max(abs(lo), abs(hi))

# I- Each input value must be assigned exactly one discretized value (Kronecker)
exclusive_assignment = LinearConstraint(
    A=sp.hstack((
        sp.kron(sp.eye(n_A), np.ones(n_R)),
        sp.csc_array((n_A, n_A)),
        sp.csc_array((n_A, n_R)),
        sp.csc_array((n_A, n_A)),
    )),
    lb=np.ones(n_A), ub=np.ones(n_A),
)

# II- If assigned, discretized output must be <= level (Kronecker)
# discretized_output <= level + (1-assigned)*M
# assigned*M + discretized_output - level <= M
# III- If assigned, discretized output must be >= level
# discretized_output >= level - (1-assigned)*M
# assigned*-M + discretized_output - level >= -M
discrete_sparse_to_level = LinearConstraint(
    A=sp.bmat((
        (
            sp.eye(n_A*n_R) * M,
            sp.kron(sp.eye(n_A), np.ones((n_R, 1))),
            sp.csc_array(
                np.tile(-np.eye(n_R), (n_A, 1))
            ),
            sp.csc_array((n_A*n_R, n_A)),
        ),
        (
            sp.eye(n_A*n_R) * -M,
            sp.kron(sp.eye(n_A), np.ones((n_R, 1))),
            sp.csc_array(
                np.tile(-np.eye(n_R), (n_A, 1))
            ),
            sp.csc_array((n_A*n_R, n_A)),
        ),
    ), format='csc'),
    lb=np.concatenate((np.full(n_A*n_R, -np.inf), np.full(n_A*n_R, -M))),
    ub=np.concatenate((np.full(n_A*n_R, M),       np.full(n_A*n_R, np.inf))),
)

# IV- error >= output - input (Kronecker)
# A.assign - output + error >= 0
# V- error >= input - output
# -A.assign + output + error >= 0
abs_error = LinearConstraint(
    A=sp.bmat((
        (
            sp.kron(sp.diags(A), np.ones(n_R)),
            -sp.eye(n_A),
            sp.csc_array((n_A, n_R)),
            np.eye(n_A),
        ),
        (
            sp.kron(sp.diags(-A), np.ones(n_R)),
            sp.eye(n_A),
            sp.csc_array((n_A, n_R)),
            np.eye(n_A),
        ),
    ), format='csc'),
    lb=np.concatenate((np.zeros(n_A),        np.zeros(n_A))),
    ub=np.concatenate((np.full(n_A, np.inf), np.full(n_A, np.inf))),
)

level_order = LinearConstraint(
    A=sp.hstack((
        sp.csc_array((n_R-1, n_A*n_R)),
        sp.csc_array((n_R-1, n_A)),
        sp.eye(n_R-1, n_R, k=1) - sp.eye(n_R-1, n_R),
        sp.csc_array((n_R-1, n_A)),
    )),
    lb=np.zeros(n_R-1),
    ub=np.full(n_R-1, np.inf),
)

start = time.perf_counter()
res = milp(
    c=c, integrality=integrality, bounds=Bounds(lb=lb, ub=ub),
    constraints=(
        exclusive_assignment,
        abs_error,
        discrete_sparse_to_level,
        # level_order,
    ),
)
stop = time.perf_counter()
print(f'Completed in {stop - start:.4f} s')
assert res.success

assign, discretized, levels, errors = np.split(
    res.x, (n_A*n_R, n_A*n_R + n_A, -n_A)
)
assign = assign.reshape((n_A, n_R))
print(levels)
print(np.stack((A, discretized, errors), axis=1))
print(assign)
Reinderien
  • 11,755
  • 5
  • 49
  • 77