9

I am trying to generate a random column of categorical variable from an existing column to create some synthesized data. For example if my column has 3 values 0,1,2 with 0 appearing 50% of the time and 1 and 2 appearing 30 and 20% of the time I want my new random column to have similar (but not same) proportions as well

There is a similar question on cross validated that has been solved using R. https://stats.stackexchange.com/questions/14158/how-to-generate-random-categorical-data. However I would like a Python Solution for this

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
Dwarkesh23
  • 119
  • 1
  • 7
  • Possible duplicate of [Pandas Random Weighted Choice](https://stackoverflow.com/questions/45223537/pandas-random-weighted-choice) – help-ukraine-now Aug 09 '19 at 18:34
  • Otherwise please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/minimal-reproducible-example) – help-ukraine-now Aug 09 '19 at 18:34
  • Perhaps this could help https://stackoverflow.com/questions/55027404/generate-larger-synthetic-dataset-based-on-a-smaller-dataset-in-python – IronMan Aug 09 '19 at 18:35

1 Answers1

22

Use np.random.choice() and specify a vector of probabilities corresponding to the chosen-from arrray:

>>> import numpy as np 
>>> np.random.seed(444) 
>>> data = np.random.choice(  
...     a=[0, 1, 2],  
...     size=50,  
...     p=[0.5, 0.3, 0.2]  
... )                                                                                                                                                                                                                                                        
>>> data                                                                                                                                                                                                                                                     
array([2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1,
       1, 1, 0, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 2, 2, 2,
       1, 1, 1, 0, 0, 1])
>>> np.bincount(data) / len(data)    # Proportions                                                                                                                                                                                                                          
array([0.44, 0.32, 0.24])

As your sample size increases, the empirical frequencies should converge towards your targets:

>>> a_lot_of_data = np.random.choice(  
...     a=[0, 1, 2],  
...     size=500_000,  
...     p=[0.5, 0.3, 0.2]  
... )
>>> np.bincount(a_lot_of_data) / len(a_lot_of_data)                                                                                                                                                                                                          
array([0.499716, 0.299602, 0.200682])

As noted by @WarrenWeckesser, if you already have the 1d NumPy array or Pandas Series, you can use that directly as the input without specifying p. The default of np.random.choice() is to sample with replacement (replace=True), so by passing your original data, the resulting distribution should approximate that of the input.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • If, instead of using `[0, 1, 2]` as the choices, you used the existing column, then you wouldn't need to specify `p`. The distribution of the random choices would naturally follow the distribution of the existing column. – Warren Weckesser Aug 09 '19 at 18:37
  • That's a very good point. I was assuming the probabilities specified should be exact, pinned at the percentages from the question. – Brad Solomon Aug 09 '19 at 18:37