Taking equal number of elements from two arrays, such that the taken values have as few duplicates as possible

Question

Consider we have 2 arrays of size N, with their values in the range [0, N-1]. For example:

a = np.array([0, 1, 2, 0])
b = np.array([2, 0, 3, 3])

I need to produce a new array c which contains exactly N/2 elements from a and b respectively, i.e. the values must be taken evenly/equally from both parent arrays.

(For odd length, this would be (N-1)/2 and (N+1)/2. Can also ignore odd length case, not important).

Taking equal number of elements from two arrays is pretty trivial, but there is an additional constraint: c should have as many unique numbers as possible / as few duplicates as possible.

For example, a solution to a and b above is:

c = np.array([b[0], a[1], b[2], a[3]])
>>> c
array([2, 1, 3, 0])

Note that the position/order is preserved. Each element of a and b that we took to form c is in same position. If element i in c is from a, c[i] == a[i], same for b.

A straightforward solution for this is simply a sort of path traversal, easy enough to implement recursively:

def traverse(i, a, b, path, n_a, n_b, best, best_path):
    if n_a == 0 and n_b == 0:
        score = len(set(path))
        return (score, path.copy()) if score > best else (best, best_path)

    if n_a > 0:
        path.append(a[i])
        best, best_path = traverse(i + 1, a, b, path, n_a - 1, n_b, best, best_path)
        path.pop()
    
    if n_b > 0:
        path.append(b[i])
        best, best_path = traverse(i + 1, a, b, path, n_a, n_b - 1, best, best_path)
        path.pop()

    return best, best_path

Here n_a and n_b are how many values we will take from a and b respectively, it's 2 and 2 as we want to evenly take 4 items.

>>> score, best_path = traverse(0, a, b, [], 2, 2, 0, None)
>>> score, best_path
(4, [2, 1, 3, 0])

Is there a way to implement the above in a more vectorized/efficient manner, possibly through numpy?

To be clear: when you say "sample", it sounds like you mean sample randomly. If not, if you mean a deterministic sample, you need to explain more about how that works. — Reinderien, Dec 12 '22 at 13:39
You say that the arrays are each of size N containing [0, N-1]. So there will never be any uniques, right? The arrays cover the same values. — Reinderien, Dec 12 '22 at 14:22
"Taken evenly" requires more explanation. Do you need to preserve order? Do you need to attempt some uniform sample index spacing? — Reinderien, Dec 12 '22 at 14:32
@Reinderien 1) "Sample" seems incorrect, yes. To be very clear, I need to take equally from both. If `a` and `b` are of length 10, then I need *exactly* 5 from each to make `c`. If they are odd, e.g. 9, then I can take 4 and 5 (or vice versa). 2) The values are in range [0, N-1]. Each array can individually be fully unique ([0, 1, 2]), or even be full of duplicates, like ([0, 0, 0]). 3) Order needs to be preserved, like the example I gave. You can kinda think of it like I am interleaving `a` and `b`, to produce a maximally diverse `c`. — Mercury, Dec 12 '22 at 16:00

Reinderien · Answer 1 · 2022-12-12T23:53:45.897

1

Test this heavily.

import numpy as np
from numpy.random._generator import default_rng


rand = default_rng(seed=1)
n = 16
a = rand.integers(low=0, high=n, size=n)
b = rand.integers(low=0, high=n, size=n)
uniques = np.setxor1d(a, b)
print(a)
print(b)
print(uniques)


def limited_uniques(arr: np.ndarray) -> np.ndarray:
    choose = np.zeros(shape=n, dtype=bool)
    _, idx, _ = np.intersect1d(arr, uniques, return_indices=True)
    idx = idx[:n//2]
    choose[idx] = True
    n_missing = n//2 - len(idx)
    counts = choose.cumsum()
    diffs = np.arange(n) - counts
    at = np.searchsorted(diffs, n_missing)
    choose[:at] = True
    return arr[choose]


a_half = limited_uniques(a)
uniques = np.union1d(uniques, np.setdiff1d(a, a_half))

interleaved = np.empty_like(a)
interleaved[0::2] = a_half
interleaved[1::2] = limited_uniques(b)
print(interleaved)

[ 7  8 12 15  0  2 13 15  3  4 13  6  4 13  4  6]
[10  8  1  0 13 12 13  8 13  5  7 12  1  4  1  7]
[ 1  2  3  5  6 10 15]
[ 7 10  8  8 12  1 15  0  0 13  2 12  3  5  6  4]

edited Dec 12 '22 at 23:53

answered Dec 12 '22 at 23:15

Reinderien

11,755
5
49
77

This algorithm is not equivalent to the one of the OP and it does not find the optimal score. The optimal score is 13 (with `n_a, n_b = (8,8)`) while your solution gives 12. One better path is `[10,8,12,15,0,2,13,8,3,5,7,6,1,4,1,7]`. – Jérôme Richard Dec 12 '22 at 23:25
@JérômeRichard Try again; though I will say that the non-equivalence is deliberate. Among other things that should change, this does an actual index interleave. – Reinderien Dec 12 '22 at 23:55
It still do not work. There are cases where the algorithm gives wrong path and over-optimal values. For example, for `a = [ 6 9 7 12 5 9 12 14 6 0 11 8 13 7 5 0]` and `b = [ 7 10 12 13 3 9 12 4 5 13 9 8 10 8 15 12]`, the OP solution is `[ 6 9 7 12 3 9 12 4 5 13 11 8 10 8 15 0]` with a score of 13 while your solution give a score of 14 with a path `[ 6 7 9 10 7 12 ...]`. There is no 7 in the second item. I am not sure the approach can work. – Jérôme Richard Dec 13 '22 at 01:19
@Reinderien thanks, I will test this out later. Even if it isn't optimal, if the deviation from best possible solution is miniscule, I feel I could still get some use out of this as a heuristic/approximation when working on larger `n`. *Edit:* However it seems like there are positions at which `interleaved[i]` is not equal to `a[i]` nor `b[i]`, i.e. position isn't conserved? – Mercury Dec 14 '22 at 08:17
Please offer an example – Reinderien Dec 14 '22 at 12:32

score 1 · Accepted Answer · answered Dec 13 '22 at 03:02

The algorithm is slow mainly because it runs in an exponential time. There is no straightforward way to vectorize this algorithm using only Numpy because of the recursion. Even if it would be possible, the huge number of combinations would cause most Numpy implementations to be inefficient (due to large Numpy arrays to compute). Additionally, there is AFAIK no vectorized operation to count the number of unique values of many rows efficiently (the usual way is to use np.unique which is not efficient in this case and cannot be use without a loop). As a result, there is two possible strategy to speed this up:

trying to find an algorithm with a reasonable complexity (eg. <= O(n^4));
using compilation methods, micro-optimizations and tricks to write a faster brute-force implementation.

Since finding a correct sub-exponential algorithm turns out not to be easy, I choose the other approach (though the first approach is the best).

The idea is to:

remove the recursion by generating all possible solutions using a loop iterating on integer;
write a fast way to count unique items of an array;
use the Numba JIT compiler so to optimize the code that is only efficient once compiled.

Here is the final code:

import numpy as np
import numba as nb

# Naive way to count unique items.
# This is a slow fallback implementation.
@nb.njit
def naive_count_unique(arr):
    count = 0
    for i in range(len(arr)):
        val = arr[i]
        found = False
        for j in range(i):
            if arr[j] == val:
                found = True
                break
        if not found:
            count += 1
    return count

# Optimized way to count unique items on small arrays.
# Count items 2 by 2.
# Fast on small arrays.
@nb.njit
def optim_count_unique(arr):
    count = 0
    for i in range(0, len(arr), 2):
        if arr[i] == arr[i+1]:
            tmp = 1
            for j in range(i):
                if arr[j] == arr[i]: tmp = 0
            count += tmp
        else:
            val1, val2 = arr[i], arr[i+1]
            tmp1, tmp2 = 1, 1
            for j in range(i):
                val = arr[j]
                if val == val1: tmp1 = 0
                if val == val2: tmp2 = 0
            count += tmp1 + tmp2
    return count

@nb.njit
def count_unique(arr):
    if len(arr) % 2 == 0:
        return optim_count_unique(arr)
    else:
        # Odd case: not optimized yet
        return naive_count_unique(arr)

# Count the number of bits in a 32-bit integer
# See https://stackoverflow.com/questions/71097470/msb-lsb-popcount-in-numba
@nb.njit('int_(uint32)', inline='always')
def popcount(v):
    v = v - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    c = np.uint32((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24
    return c

# Count the number of bits in a 64-bit integer
@nb.njit(inline='always')
def bit_count(n):
    if n < (1 << 30):
        return popcount(np.uint32(n))
    else:
        return popcount(np.uint32(n)) + popcount(np.uint32(n >> 32))

# Mutate `out` so not to create an expensive new temporary array
@nb.njit
def int_to_path(n, out, a, b):
    for i in range(len(out)):
        out[i] = a[i] if ((n >> i) & 1) else b[i]

@nb.njit(['(int32[:], int32[:], int64, int64)', '(int64[:], int64[:], int64, int64)'])
def traverse_fast(a, b, n_a, n_b):
    # This assertion is needed because the paths are encoded using 64-bit.
    # This should not be a problem in practice since the number of solutions to
    # test would be impracticably huge to test using this algorithm anyway.
    assert n_a + n_b < 62

    max_iter = 1 << (n_a + n_b)
    path = np.empty(n_a + n_b, dtype=a.dtype)
    score, best_score, best_i = 0, 0, 0

    # Iterate over all cases (more than the set of possible solution)
    for i in range(max_iter):
        # Filter the possible solutions
        if bit_count(i) != n_b:
            continue

        # Analyse the score of the solution
        int_to_path(i, path, a, b)
        score = count_unique(path)

        # Store it if it better than the previous one
        if score > best_score:
            best_score = score
            best_i = i

    int_to_path(best_i, path, a, b)
    return best_score, path

This implementation is about 30 times faster on arrays of size 8 on my machine. On could use several cores to speed this up even further. However, I think it is better to focus on finding a sub-exponential implementation so to avoid wasting more computing resources. Note that the path is different from the initial function but the score is the same on random arrays. It can help others to test their implementation on larger arrays without waiting for a long time.

Thanks, this has given me some things to think about. (I am currently in a flight layover; I'll test out the code once I land!) — Mercury, Dec 14 '22 at 05:17
The jit-compiled solution works great, many small performance tricks there. Though yes, I should probably take the time to find a better algorithm under the hood, as you suggested. — Mercury, Dec 14 '22 at 09:39

Taking equal number of elements from two arrays, such that the taken values have as few duplicates as possible

2 Answers2