-1

I have a 2d matrix with a very large dimension (~10000 x 10000) I want to find the frequency of the most common number in each row, and select the row that has the least frequency.

For example, if my 2d array is

[1, 2, 3, 3, 5]
[1, 1, 1, 2, 1]
[3, 2, 2, 1, 3]
[4, 5, 1, 2, 2]
[3, 5, 6, 7, 8]

,

The frequency of most common number in each row is [2, 4, 2, 2, 1]. I want to find the index 4 (the last row) because the frequency of most common items is 1, which is the lowest among all rows.

Right now I am just using for loops but is there a vectorized approach that is very fast?

Thank you in advance.

John Doyle
  • 57
  • 3
  • Does this answer your question? [get unique count ~and~ unique values on per row basis using numpy](https://stackoverflow.com/questions/37848468/get-unique-count-and-unique-values-on-per-row-basis-using-numpy) – ddejohn Nov 30 '22 at 05:51
  • What is the range of values in your 2D array, i.e. what is the largest number you can expect? Can there be negatives? – Mercury Nov 30 '22 at 07:32

1 Answers1

1

For the general case, your best option should be using scipy.stats.mode as already pointed out by a comment.

However, in the case that your array contains only non-negative values, and the values are limited in range (suppose the values are ranging from 0-10, or 0-100, or 0-1000) then it would be much faster to use np.bincount. For example:

import numpy as np
from scipy.stats import mode

def foo(a):
  return np.argmin(mode(a, axis=1)[1])

def bar(a):
  return np.argmin([np.bincount(r).max() for r in a])

# Testing on a 10000x10000 array
a = np.random.randint(0, 1000, (10000, 10000), np.int32)

print(foo(a))
# 491

print(bar(a))
# 491

# Time required:
%timeit -n 5 -r 1 foo(a)
# 6.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)

%timeit -n 5 -r 1 bar(a)
# 317 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
Mercury
  • 3,417
  • 1
  • 10
  • 35