numpy: sampling in several dimensions to generate microdata

Question

I have summary-level data of the count of people by age group, city, income and the industry in which they work, or in this case four dimensions.

What I would like to is to generate micro-data from these summary counts. If, say, the summary table shows 10,000 people distributed by gender, race, age and industry, I would like to have 10,000 records which, when summarized, matches the original four-dimensional distribution. So in short, I would like to sample from four distributions at the same time, conditional on the values of the others.

Here's what I have:

## generate mock person data
N=500000

age = np.random.choice(['20-44','45-64','65+'], N)
ind = np.random.choice(['retail','construction','information','medical'], size=N,p=[.05,.15,.3,.5])
cty = np.random.choice(['cooltown','mountain pines'], N)
income = pd.cut(np.random.lognormal(mean=10,sigma=2,size=N),range(0,250000,50000)+[np.Inf])

## prep data frame
persons = pd.DataFrame({'industry':ind,'city':cty,'income':income,'age':age})

## group by the categoricals
persons_grouped = persons.groupby(['city','industry','age','income']).size()


df_persons_grouped=persons_grouped.reset_index(name='personcount')
df_persons_grouped['personcount']=df_persons_grouped.personcount.div(df_persons_grouped.personcount.sum(),axis=0)
df_persons_grouped.head()

So that is now summarized by the dimensions in question.

To re-generate the original number of records, I do like so:

newdf = df_persons_grouped.loc[np.random.choice(a=df_persons_grouped.index, size=N, p=df_persons_grouped.personcount.tolist())].groupby(['city','industry','age','income']).size()

## I expect the follwoing to produce near-1 values, but they sometimes vary
newdf.div( persons_grouped,axis=0)

But the bigger question is if this approach is kosher for reproducing the "original" record-level data. I just use the counts (as shares) as probabilities, which may be different than sampling from multivariate distributions. Suggestions welcome.

That doesn't look remotely realistic to me. Do you really think that income is independent of both age and industry? — pjs, Sep 09 '15 at 21:41
@pjs, no that is kind of the point. I do think that there is dependence. That is why I am interested in drawing from joint distributions of membership in each bin. The data is of course made up. — ako, Sep 10 '15 at 00:05
@pjs, it is a little like PopGen but on the cheap. Go from summaries to individual records, respecting marginal distributions. And mainly with discrete variables. No direct n-dimensional covariance here, just (observed sample-based) bins. — ako, Sep 10 '15 at 01:55

numpy: sampling in several dimensions to generate microdata

0 Answers0