0

I have summary-level data of the count of people by age group, city, income and the industry in which they work, or in this case four dimensions.

What I would like to is to generate micro-data from these summary counts. If, say, the summary table shows 10,000 people distributed by gender, race, age and industry, I would like to have 10,000 records which, when summarized, matches the original four-dimensional distribution. So in short, I would like to sample from four distributions at the same time, conditional on the values of the others.

Here's what I have:

## generate mock person data
N=500000

age = np.random.choice(['20-44','45-64','65+'], N)
ind = np.random.choice(['retail','construction','information','medical'], size=N,p=[.05,.15,.3,.5])
cty = np.random.choice(['cooltown','mountain pines'], N)
income = pd.cut(np.random.lognormal(mean=10,sigma=2,size=N),range(0,250000,50000)+[np.Inf])

## prep data frame
persons = pd.DataFrame({'industry':ind,'city':cty,'income':income,'age':age})

## group by the categoricals
persons_grouped = persons.groupby(['city','industry','age','income']).size()


df_persons_grouped=persons_grouped.reset_index(name='personcount')
df_persons_grouped['personcount']=df_persons_grouped.personcount.div(df_persons_grouped.personcount.sum(),axis=0)
df_persons_grouped.head()

So that is now summarized by the dimensions in question.

To re-generate the original number of records, I do like so:

newdf = df_persons_grouped.loc[np.random.choice(a=df_persons_grouped.index, size=N, p=df_persons_grouped.personcount.tolist())].groupby(['city','industry','age','income']).size()

## I expect the follwoing to produce near-1 values, but they sometimes vary
newdf.div( persons_grouped,axis=0)

But the bigger question is if this approach is kosher for reproducing the "original" record-level data. I just use the counts (as shares) as probabilities, which may be different than sampling from multivariate distributions. Suggestions welcome.

ako
  • 3,569
  • 4
  • 27
  • 38
  • That doesn't look remotely realistic to me. Do you really think that income is independent of both age and industry? – pjs Sep 09 '15 at 21:41
  • @pjs, no that is kind of the point. I do think that there is dependence. That is why I am interested in drawing from joint distributions of membership in each bin. The data is of course made up. – ako Sep 10 '15 at 00:05
  • @pjs, it is a little like PopGen but on the cheap. Go from summaries to individual records, respecting marginal distributions. And mainly with discrete variables. No direct n-dimensional covariance here, just (observed sample-based) bins. – ako Sep 10 '15 at 01:55

0 Answers0