0

The problem is very simple, I have a vector of indices from which I want to extract one set randomly chosen and its complement. So I write the following code:

import numpy as np    
vec = np.arange(0,25000)
idx = np.random.choice(vec,5000)
idx_r = np.delete(vec,idx)

However, when I print the length of vec, idx, and idx_r they do not match. The sum between idx and idx_r return values higher than len(vec). For example, the following code:

print(len(idx))
print(len(idx_r))
print(len(idx_r)+len(idx))
print(len(vec))

returns:

5000 20462 25462 25000

Python version is 3.8.1 and GCC is 9.2.0.

Yule Vaz
  • 3
  • 1

1 Answers1

0

The np.random.choice has a keyword argument replace. Its default value is True. If you set the value to False, I think you will get the desired result.

import numpy as np

vec = np.arange(0, 25000)

idx = np.random.choice(vec, 5000, replace=False)

idx_r = np.delete(vec, idx)

print([len(item) for item in (vec, idx, idx_r)])

Out:

[25000, 5000, 20000]

However, numpy.random.choice with replace=False is extremely inefficient due to poor implementation choices they're stuck with for backward compatibility - it generates a permutation of the whole input just to take a small sample. You should use the new Generator API instead, which doesn't have this issue:

rng = np.random.default_rng()

idx = rng.choice(vec, 5000, replace=False)
user2357112
  • 260,549
  • 28
  • 431
  • 505
dmmfll
  • 2,666
  • 2
  • 35
  • 41
  • You're welcome. I'm just learning Numpy myself. Thanks for posting this. I didn't know about the methods you are using until now. Please mark it as the correct answer if it solved your issue. – dmmfll May 02 '20 at 21:07