Use np.random.choice()
and specify a vector of probabilities corresponding to the chosen-from arrray:
>>> import numpy as np
>>> np.random.seed(444)
>>> data = np.random.choice(
... a=[0, 1, 2],
... size=50,
... p=[0.5, 0.3, 0.2]
... )
>>> data
array([2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1,
1, 1, 0, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 2, 2, 2,
1, 1, 1, 0, 0, 1])
>>> np.bincount(data) / len(data) # Proportions
array([0.44, 0.32, 0.24])
As your sample size increases, the empirical frequencies should converge towards your targets:
>>> a_lot_of_data = np.random.choice(
... a=[0, 1, 2],
... size=500_000,
... p=[0.5, 0.3, 0.2]
... )
>>> np.bincount(a_lot_of_data) / len(a_lot_of_data)
array([0.499716, 0.299602, 0.200682])
As noted by @WarrenWeckesser, if you already have the 1d NumPy array or Pandas Series, you can use that directly as the input without specifying p
. The default of np.random.choice()
is to sample with replacement (replace=True
), so by passing your original data, the resulting distribution should approximate that of the input.