3

Let's say I have the following 2D array:

import numpy as np

np.random.seed(123)
a = np.random.randint(1, 6, size=(5, 3))

which produces:

In [371]: a
Out[371]:
array([[3, 5, 3],
       [2, 4, 3],
       [4, 2, 2],
       [1, 2, 2],
       [1, 1, 2]])

is there a more efficient (Numpy, Pandas, etc.) way to calculate a freuency of all pairs of numbers than the following solution?

from collections import Counter
from itertools import combinations

def pair_freq(a, sort=False, sort_axis=-1):
    a = np.asarray(a)
    if sort:
        a = np.sort(a, axis=sort_axis)
    res = Counter()
    for row in a:
        res.update(combinations(row, 2))
    return res

res = pair_freq(a)

to produce something like that:

In [38]: res
Out[38]:
Counter({(3, 5): 1,
         (3, 3): 1,
         (5, 3): 1,
         (2, 4): 1,
         (2, 3): 1,
         (4, 3): 1,
         (4, 2): 2,
         (2, 2): 2,
         (1, 2): 4,
         (1, 1): 1})

or:

In [39]: res.most_common()
Out[39]:
[((1, 2), 4),
 ((4, 2), 2),
 ((2, 2), 2),
 ((3, 5), 1),
 ((3, 3), 1),
 ((5, 3), 1),
 ((2, 4), 1),
 ((2, 3), 1),
 ((4, 3), 1),
 ((1, 1), 1)]

PS the resulting dataset might look differently - for example like a multi-index Pandas DataFrame or something else.

I was trying to increase the dimensionality of the a array and to use np.isin() together with the a list of a combinations of all pairs, but I still couldn't get rid of a loop.

UPDATE:

(a) Are you interested only in the frequency of combinations of 2 numbers (and not interested in frequency of combinations of 3 numbers)?

yes, i'm interested in combinations of pairs (2 numbers) only

(b) Do you want to consider (3,5) as distinct from (5,3) or do you want to consider them as two occurrences of the same thing?

actually both approaches are fine - I can always sort my array beforehand if I need:

a = np.sort(a, axis=1)

UPDATE2:

Do you want the distinction between (a,b) and (b,a) to happen only due to the source column of a and b, or even otherwise? Do understand this question, please consider three rows [[1,2,1], [3,1,2], [1,2,5]]. What do you think should be the output here? What should be the distinct 2-tuples and what should be their frequencies?

In [40]: a = np.array([[1,2,1],[3,1,2],[1,2,5]])

In [41]: a
Out[41]:
array([[1, 2, 1],
       [3, 1, 2],
       [1, 2, 5]])

I would expect the following result:

In [42]: pair_freq(a).most_common()
Out[42]:
[((1, 2), 3),
 ((1, 1), 1),
 ((2, 1), 1),
 ((3, 1), 1),
 ((3, 2), 1),
 ((1, 5), 1),
 ((2, 5), 1)]

because it's more flexible, so I want to count (a, b) and (b, a) as the same pair of elements I could do this:

In [43]: pair_freq(a, sort=True).most_common()
Out[43]: [((1, 2), 4), ((1, 1), 1), ((1, 3), 1), ((2, 3), 1), ((1, 5), 1), ((2, 5), 1)]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    **(a)** Are you interested only in the frequency of combinations of 2 numbers (and not interested in frequency of combinations of 3 numbers)? **(b)** Do you want to consider `(3,5)` as distinct from `(5,3)`, or do you want to consider them as two occurrences of the same thing? – fountainhead Mar 10 '19 at 14:37
  • @fountainhead, thank you for your comment! I've extended my questions with an additional information... – MaxU - stand with Ukraine Mar 10 '19 at 14:42
  • In your sample data, the first row has `(3,5,3)`, which generates the 2-tuples `(3,5)` and `(5,3)`. Your current solution treats these two tuples as distinct, presumably because the `3` in `(3,5)` comes from the first column, and the `3` in `(5,3)` comes from the third column. Do you want to stick with that distinction? Just double-checking despite your edit. – fountainhead Mar 10 '19 at 14:58
  • Related question. The row `[1,2,2]` generates the 2-tuple `(1,2)` twice (since `2` is present in two different columns), and your current solution treats these two 2-tuples as distinct. Is that distinction important? – fountainhead Mar 10 '19 at 15:13
  • @fountainhead, yes, I think counting `(a, b)` and `(b, a)` as distinct pairs would gives us more flexibility – MaxU - stand with Ukraine Mar 10 '19 at 15:14
  • @fountainhead. `The row [1,2,2] generates the 2-tuple (1,2) twice (since 2 is present in two different columns), and your current solution treats these two 2-tuples as distinct. Is that distinction important?` - yes, I'd like to count `(1,2)` twice in this case – MaxU - stand with Ukraine Mar 10 '19 at 15:17
  • Do you want the distinction between `(a,b)` and `(b,a)` to happen only due to the source column of `a` and `b`, or even otherwise? Do understand this question, please consider three rows `[1,2,1]`. `[3,1,2]`, and `[1,2,5]`. What do you think should be the output here? What should be the distinct 2-tuples and what should be their frequencies? – fountainhead Mar 10 '19 at 15:19
  • I think my last question is tantamount to asking do you want permutations instead of combinations. So, I think you may pls ignore that question. – fountainhead Mar 10 '19 at 15:27
  • @fountainhead, i tried to extend my answer, so that it clarifies your questions in comments... – MaxU - stand with Ukraine Mar 10 '19 at 15:38

2 Answers2

2

If your elements are not too large nonnegative integers bincount is fast:

from collections import Counter
from itertools import combinations
import numpy as np

def pairs(a):
    M = a.max() + 1
    a = a.T
    return sum(np.bincount((M * a[j] + a[j+1:]).ravel(), None, M*M)
               for j in range(len(a) - 1)).reshape(M, M)

def pairs_F_3(a):
    M = a.max() + 1
    return (np.bincount(a[1:].ravel() + M*a[:2].ravel(), None, M*M) +
            np.bincount(a[2].ravel() + M*a[0].ravel(), None, M*M))

def pairs_F(a):
    M = a.max() + 1
    a = np.ascontiguousarray(a.T) # contiguous columns (rows after .T)
                                  # appear to be typically perform better
                                  # thanks @ning chen
    return sum(np.bincount((M * a[j] + a[j+1:]).ravel(), None, M*M)
               for j in range(len(a) - 1)).reshape(M, M)

def pairs_dict(a):
    p = pairs_F(a)
    # p is a 2D table with the frequency of (y, x) at position y, x
    y, x = np.where(p)
    c = p[y, x]
    return {(yi, xi): ci for yi, xi, ci in zip(y, x, c)}

def pair_freq(a, sort=False, sort_axis=-1):
    a = np.asarray(a)
    if sort:
        a = np.sort(a, axis=sort_axis)
    res = Counter()
    for row in a:
        res.update(combinations(row, 2))
    return res


from timeit import timeit
A = [np.random.randint(0, 1000, (1000, 120)),
     np.random.randint(0, 100, (100000, 12))]
for a in A:
    print('shape:', a.shape, 'range:', a.max() + 1)
    res2 = pairs_dict(a)
    res = pair_freq(a)
    print(f'results equal: {res==res2}')
    print('bincount', timeit(lambda:pairs(a), number=10)*100, 'ms')
    print('bc(F)   ', timeit(lambda:pairs_F(a), number=10)*100, 'ms')
    print('bc->dict', timeit(lambda:pairs_dict(a), number=10)*100, 'ms')
    print('Counter ', timeit(lambda:pair_freq(a), number=4)*250,'ms')

Sample run:

shape: (1000, 120) range: 1000
results equal: True
bincount 461.14772390574217 ms
bc(F)    435.3669326752424 ms
bc->dict 932.1215840056539 ms
Counter  3473.3258984051645 ms
shape: (100000, 12) range: 100
results equal: True
bincount 89.80463854968548 ms
bc(F)    43.449611216783524 ms
bc->dict 46.470773220062256 ms
Counter  1987.6734036952257 ms
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
2

I have an idea, the code as follow. The hugest disadvantage of my code is that it run very slowly as the column increase, and it is slower than code from @Paul Panzer. I apologize to Paul Panzer.

And if you want to be more quick, just ignore the function for num_to_items. because (1, 1) is equal to 1*2**20 + 1.

import numpy as np
from random import choice
from itertools import izip
from scipy.sparse import csr_matrix, csc_matrix
from scipy import sparse as sp


c_10 = np.array([[choice(range(1, 10)) for _ in range(3)] for _ in range(1000)])
c_1000 = np.array([[choice(range(1, 1000)) for _ in range(3)] for _ in range(1000)])

def _bit_to_items(num):
    return (num >> 20, num & 0b1111111111111111111)


def unique_bit_shit(c):
    cc = c << 20 # suppose that: 2**20 > max(c)

    dialog_mtx_1 = np.array([[1, 0, 0],
                             [1, 0, 0],
                             [0, 1, 0]])

    dialog_mtx_2 = np.array([[0, 1, 0],
                             [0, 0, 1],
                             [0, 0, 1]])

    dialog_mtx_1 = dialog_mtx_1.T
    dialog_mtx_2 = dialog_mtx_2.T

    pairs = cc.dot(dialog_mtx_1) + c.dot(dialog_mtx_2)
    pairs_num, count = np.unique(pairs, return_counts=True)
    return [(_bit_to_items(num), v) for num, v in izip(pairs_num, count)]



def _dot_to_items(num):
    # 2**20 is 1048576
    return (num / 1048576, num % 1048576) 


def unique_dot(c):
    dialog_mtx_3 = np.array([[2**20, 2**20, 0],
                             [1,     0,     2**20],
                             [0,     1,     1]])

    pairs = c.dot(dialog_mtx_3)

    pairs_num, count = np.unique(pairs, return_counts=True)
    return [(_dot_to_items(num), v) for num, v in izip(pairs_num, count)]
Happy Boy
  • 526
  • 2
  • 9
  • _"when the range of elements in a is very large"_ actually the correct condition would be "if the range is large compared to array size". For example if you increase the number of rows to `10000` in your arrays then bincount is faster. Also, does your code work for anything other than 3 columns? – Paul Panzer Mar 11 '19 at 04:26
  • As you said, my test don't cover all cases. 'My code is more quick' is not preciseness. – Happy Boy Mar 11 '19 at 04:53
  • @Paul Panzer. The cost time of `np.dot` and `np.unique` increase as the number of columns increase. So `unique_dot` will be very slow as columns more than 100, with `csc_matrix.dot`, it will be quick but still slower than your code. – Happy Boy Mar 11 '19 at 09:07