Find intersecting values in multiple numpy arrays

Question

I have 100 large arrays > 250,000 elements each. I want to find common values that are found in these arrays. I know that there are not going to be values that are found in all 100 arrays, but a small number values will be found in multiple arrays (I suspect 10-30%). I want to find which values are found with the highest frequency across these arrays. (Side point: arrays have no duplicates)

I know that I can loop through the arrays and eventually find them, but that will take a while. I also know about the np.intersect1d function, but I that only gives values that are found within all of the arrays, whereas I'm looking for values that are only going to be in around 20 of the 100 arrays.

My best bet is use the np.intersect1d function and loop through all possible combinations of the arrays, which would definitely take a while, but not as long as simply looping through all 250,000 x 100 values. Example:

array_1 = array([1.98,2.33,3.44,,...11.1)
array_2 = array([1.26,1.49,4.14,,...9.0)
array_2 = array([1.58,2.33,3.44,,...19.1)
array_3 = array([4.18,2.03,3.74,,...12.1)
.
.
. 
array_100= array([1.11,2.13,1.74,,...1.1)

No values in all 100, Is there a value that can be found in 30 different arrays?

Are all the arrays the same size? Can you have one large 250k x 100 array? — Mad Physicist, Oct 24 '18 at 03:37

Mad Physicist · Accepted Answer · 2018-10-24T03:49:15.827

1

You can either use np.unique with the return_counts keyword, or a vanilla Python Counter.

The first option works if you can concatenate your arrays into a single 250k x 100 monolith, or even string them out over after the other:

unq, counts = np.unique(monolith, return_counts=True)
ind = np.argsort(counts)[::-1]
unq = unq[ind]
counts = counts[ind]

This will leave you with an array containing all the unique values, and the frequency with which they occur.

If the arrays have to remain separate, use collections.Counter to accomplish the same task. In the following, I assume that you have a list containing your arrays. It would be very pointless to have a hundred individually named variables:

c = Counter() for arr in arrays: c.update(arr)

Now c.most_common will give you the most common elements and their counts.

edited Oct 24 '18 at 03:49

answered Oct 24 '18 at 03:38

Mad Physicist

107,652
25
181
264

Great Idea! Thank you for your response, I'll give it a try! – Danny Rabiz Oct 24 '18 at 03:42
@Danny. Updated – Mad Physicist Oct 24 '18 at 03:49
The proper way to thank is by selecting the answer by clicking on the check mark next to it. – Mad Physicist Oct 24 '18 at 04:19

Find intersecting values in multiple numpy arrays

1 Answers1