Keep only sub-arrays with one unique value at position 0

Question

Starting from a Numpy nd-array:

>>> arr
[
    [
        [10, 4, 5, 6, 7],
        [11, 1, 2, 3, 4],
        [11, 5, 6, 7, 8]
    ],
    [
        [12, 4, 5, 6, 7],
        [12, 1, 2, 3, 4],
        [12, 5, 6, 7, 8]
    ],
    [
        [15, 4, 5, 6, 7],
        [15, 1, 2, 3, 4],
        [15, 5, 6, 7, 8]
    ],
    [
        [13, 4, 5, 6, 7],
        [13, 1, 2, 3, 4],
        [14, 5, 6, 7, 8]
    ],
    [
        [10, 4, 5, 6, 7],
        [11, 1, 2, 3, 4],
        [12, 5, 6, 7, 8]
    ]
]

I would like to keep only the sequences of 3 sub-arrays which have only one unique value at position 0, so as to obtain the following:

>>> new_arr
[
    [
        [12, 4, 5, 6, 7],
        [12, 1, 2, 3, 4],
        [12, 5, 6, 7, 8]
    ],
    [
        [15, 4, 5, 6, 7],
        [15, 1, 2, 3, 4],
        [15, 5, 6, 7, 8]
    ]
]

From the initial array, arr[0], arr[3] and arr[4] were discarded because they both had more than one unique value in position 0 (respectively, [10, 11], [13, 14] and [10, 11, 12]).

I tried fiddling with numpy.unique() but could only get to the global unique values at positon 0 within all sub-arrays, which is not what's needed here.

-- EDIT

The following seems to get me closer to the solution:

>>> np.unique(arr[0, :, 0])
array([10, 11])

But I'm not sure how to get one-level higher than this and put a condition on that for each sub-array of arr without using a Python loop.

rchome · Accepted Answer · 2021-11-28T16:46:46.310

5

I got this to work without any transposing.

arr = np.array(arr)
arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]

edited Nov 28 '21 at 16:46

answered Nov 28 '21 at 16:33

rchome

2,623
8
21

`TypeError: list indices must be integers or slices, not tuple` – Michael Stachura Nov 28 '21 at 16:45
1

@MichaelStachura you need to turn it into a numpy array first. – rchome Nov 28 '21 at 16:47

score 3 · Answer 2 · answered Nov 28 '21 at 19:55

I was interested to see how these methods compared so I benchmarked the answers here using a large dataset of (4000000, 4, 4).

results

--------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------
Name (time in ms)            Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_np_arr_T           128.3483 (1.0)        130.5462 (1.0)        129.0869 (1.0)       0.9536 (1.01)       128.5447 (1.0)       1.5660 (1.83)          2;0  7.7467 (1.0)           8           1
test_np_arr             128.5017 (1.00)       131.2399 (1.01)       129.2841 (1.00)      0.9414 (1.0)        128.9724 (1.00)      0.8553 (1.0)           1;1  7.7349 (1.00)          7           1
test_pure_py_set      2,840.2911 (22.13)    2,849.0413 (21.82)    2,844.4716 (22.04)     3.8494 (4.09)     2,846.1608 (22.14)     6.4168 (7.50)          3;0  0.3516 (0.05)          5           1
test_pure_py          3,688.4772 (28.74)    3,750.0933 (28.73)    3,717.3411 (28.80)    24.7294 (26.27)    3,707.3502 (28.84)    37.1902 (43.48)         2;0  0.2690 (0.03)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

These benchmarks use pytest-benchmark, so I'd make a venv for running this:

python3 -m venv venv
. ./venv/bin/activate
pip install numpy pytest pytest-benchmark

Run the test:

pytest test_runs.py

test_runs.py

import numpy as np

# No guarantee this will produce sub-arrays with shared first index
ARR = np.random.randint(low=0, high=10, size=(4_000_000, 4, 4)).tolist()
# ARR = [
#     [[10, 4, 5, 6, 7], [11, 1, 2, 3, 4], [11, 5, 6, 7, 8]],
#     [[12, 4, 5, 6, 7], [12, 1, 2, 3, 4], [12, 5, 6, 7, 8]],
#     [[15, 4, 5, 6, 7], [15, 1, 2, 3, 4], [15, 5, 6, 7, 8]],
#     [[13, 4, 5, 6, 7], [13, 1, 2, 3, 4], [14, 5, 6, 7, 8]],
#     [[10, 4, 5, 6, 7], [11, 1, 2, 3, 4], [12, 5, 6, 7, 8]],
# ]

def pure_py(arr):
    new_array = []
    for i, v in enumerate(arr):
        first_elems = [x[0] for x in v]
        if all(elem == first_elems[0] for elem in first_elems):
            new_array.append(arr[i])
    return new_array

def pure_py_set(arr):
    new_array = []
    for sub_arr in arr:
        if len(set(x[0] for x in sub_arr)) == 1:
            new_array.append(sub_arr)
    return new_array

def np_arr(arr):
    return arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]

def np_arr_T(arr):
    return arr[(arr[:, :, 0].T == arr[:, 0, 0]).T.all(axis=1)]

def np_not_arr(arr):
    arr = np.array(arr)
    return arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]

RES = np_not_arr(ARR).tolist()

def test_pure_py(benchmark):
    res = benchmark(pure_py, ARR)
    assert res == RES

def test_pure_py_set(benchmark):
    res = benchmark(pure_py_set, ARR)
    assert res == RES

def test_np_arr(benchmark):
    ARR_ = np.array(ARR)
    res = benchmark(np_arr, ARR_)
    assert res.tolist() == RES

def test_np_arr_T(benchmark):
    ARR_ = np.array(ARR)
    res = benchmark(np_arr_T, ARR_)
    assert res.tolist() == RES

You might be interested also in how to speed it up even more using Python. Usually, Python looping is quite slow. `numpy` speed things up even more. However it can't reach powers of `numba`, `cython` , `numexpr` because they are going to optimise things as much as it's possible. — mathfux, Nov 28 '21 at 21:02
nice comparision. Did you check also numpy array initialization time? — Michael Stachura, Nov 29 '21 at 10:27
There is the `np_not_arr` function which will convert a list to a `np.array` before filtering. You can add another test function that just benchmarks creation and filtering. When I included this it was always slower than the pure python approach, also iterating through an `ndarray` is _very_ slow. — Alex, Nov 29 '21 at 11:28

score 1 · Answer 3 · edited Dec 10 '21 at 19:04

Inspired by an attempt to reply in the form of an edit to the question (which I rejected as it should have been an answer), here is something that worked:

>>> arr[(arr[:,:,0].T == arr[:,0,0]).T.all(axis=1)]
[
    [
        [12, 4, 5, 6, 7],
        [12, 1, 2, 3, 4],
        [12, 5, 6, 7, 8]
    ],
    [
        [15, 4, 5, 6, 7],
        [15, 1, 2, 3, 4],
        [15, 5, 6, 7, 8]
    ]
]

The trick was to transpose the results so that:

# all 0-th positions of each subarray
arr[:,:,0].T

# the first 0-th position of each subarray 
arr[:,0,0]

# whether each 0-th position equals the first one
(arr[:,:,0].T == arr[:,0,0]).T

# keep only the sub-array where the above is true for all positions
(arr[:,:,0].T == arr[:,0,0]).T.all(axis=1)

# lastly, apply this indexing to the initial array
arr[(arr[:,:,0].T == arr[:,0,0]).T.all(axis=1)]

Hi @Jivan, I proposed [that edit](https://stackoverflow.com/review/suggested-edits/30454029). I was suspended for a week, but I surprised myself by finding an answer, but since I was suspended, I couldn't submit an answer. So I edited into the answer, hoping you'd reject it but also notice and use it :) — , Dec 07 '21 at 02:20

Michael Stachura · Answer 4 · 2021-11-28T17:20:43.223

0

Ok I've compare two solutions for this problem. With numpy (script by @rchome) and without it - pure python

new_array = []
for i, v in enumerate(arr):
    first_elems = [x[0] for x in v]
    if all(elem == first_elems[0] for elem in first_elems):
        new_array.append(arr[i])

this code execution time = (+- 0:00:00.000015)

arr = np.array(arr)
new_array = arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]

this code execution time = (+- 0:00:00.000060)

So with numpy it took about 4 times longer. But we must remember that this array is extremely small. Maybe with bigger arrays numpy would work faster :)

--EDIT-- I've enlarged array about 10times here's my results:

python: 0:00:00.000205
numpy: 0:00:00.002710

So. Maybe for this task using numpy is redundant.

edited Nov 28 '21 at 17:20

answered Nov 28 '21 at 17:10

Michael Stachura

1,100
13
18

1

I think that you're timing how long it takes to make a numpy array from a list. In my comparison the numpy solution is only slower when there's fewer than 10 sub-arrays. You could also make your solution faster by checking for uniqueness in a single line: `if len(set(x[0] for x in v)) == 1: new_array.append(v)` – Alex Nov 28 '21 at 18:59
How about if the initial array has four million elements and each sub-array has 16 elements? :) – Jivan Nov 28 '21 at 19:11
@Jivan as `(m, n, k)`? I'll post my test code – Alex Nov 28 '21 at 19:14

Keep only sub-arrays with one unique value at position 0

4 Answers4