Let's say I have the following 2D array:
import numpy as np
np.random.seed(123)
a = np.random.randint(1, 6, size=(5, 3))
which produces:
In [371]: a
Out[371]:
array([[3, 5, 3],
[2, 4, 3],
[4, 2, 2],
[1, 2, 2],
[1, 1, 2]])
is there a more efficient (Numpy, Pandas, etc.) way to calculate a freuency of all pairs of numbers than the following solution?
from collections import Counter
from itertools import combinations
def pair_freq(a, sort=False, sort_axis=-1):
a = np.asarray(a)
if sort:
a = np.sort(a, axis=sort_axis)
res = Counter()
for row in a:
res.update(combinations(row, 2))
return res
res = pair_freq(a)
to produce something like that:
In [38]: res
Out[38]:
Counter({(3, 5): 1,
(3, 3): 1,
(5, 3): 1,
(2, 4): 1,
(2, 3): 1,
(4, 3): 1,
(4, 2): 2,
(2, 2): 2,
(1, 2): 4,
(1, 1): 1})
or:
In [39]: res.most_common()
Out[39]:
[((1, 2), 4),
((4, 2), 2),
((2, 2), 2),
((3, 5), 1),
((3, 3), 1),
((5, 3), 1),
((2, 4), 1),
((2, 3), 1),
((4, 3), 1),
((1, 1), 1)]
PS the resulting dataset might look differently - for example like a multi-index Pandas DataFrame or something else.
I was trying to increase the dimensionality of the a
array and to use np.isin()
together with the a list of a combinations of all pairs, but I still couldn't get rid of a loop.
UPDATE:
(a) Are you interested only in the frequency of combinations of 2 numbers (and not interested in frequency of combinations of 3 numbers)?
yes, i'm interested in combinations of pairs (2 numbers) only
(b) Do you want to consider (3,5) as distinct from (5,3) or do you want to consider them as two occurrences of the same thing?
actually both approaches are fine - I can always sort my array beforehand if I need:
a = np.sort(a, axis=1)
UPDATE2:
Do you want the distinction between (a,b) and (b,a) to happen only due to the source column of a and b, or even otherwise? Do understand this question, please consider three rows
[[1,2,1], [3,1,2], [1,2,5]]
. What do you think should be the output here? What should be the distinct 2-tuples and what should be their frequencies?
In [40]: a = np.array([[1,2,1],[3,1,2],[1,2,5]])
In [41]: a
Out[41]:
array([[1, 2, 1],
[3, 1, 2],
[1, 2, 5]])
I would expect the following result:
In [42]: pair_freq(a).most_common()
Out[42]:
[((1, 2), 3),
((1, 1), 1),
((2, 1), 1),
((3, 1), 1),
((3, 2), 1),
((1, 5), 1),
((2, 5), 1)]
because it's more flexible, so I want to count (a, b) and (b, a) as the same pair of elements I could do this:
In [43]: pair_freq(a, sort=True).most_common()
Out[43]: [((1, 2), 4), ((1, 1), 1), ((1, 3), 1), ((2, 3), 1), ((1, 5), 1), ((2, 5), 1)]