I have what is essentially allocation problem.
What I have: I have observations of small geographic areas, like census tracts. For each, I have the count of people in four different age groups. Each tract belongs to a subregion.
Now, I know the small area distribution is not entirely correct, because I know the correct distribution--at a higher level of aggregation, the subregion level, and the finer tract-level data, when summed, shows group totals that are different.
What I would like to have: I would like to adjust my tract-level, disaggregate distribution across four groups so it is consistent with a summary-level distribution across those four groups known to be correct, but retain the signals of the tract-level distribution--i.e. adjust it based on more coarse data, but not throw it out the window.
What I would like to do, then, is shift the tract-level population counts on the margins, meeting the following criteria, with the first two being most important (I realize there are tradeoffs with respect to meeting all of these):
- it should match, when aggregated, the subregional totals.
- adjustment should not change tract level population.
- the existing spatial distribution should not be materially changed, but just me marginally adjusted per the new subregional totals
- adjustments should ideally be equitable--i.e. adjustments should not just be on a few records, but be more distributed within each region.
Below follows mock data and placeholder code:
First, the small-area data:
n=1000
np.random.seed(123)
df_small_area_scale = pd.DataFrame(data={
'grp1':np.random.randint(10,250,n),
'grp2':np.random.randint(10,250,n),
'grp3':np.random.randint(10,250,n),
'grp4':np.random.randint(10,250,n),
'subregion': np.random.choice(['A', 'B', 'C', 'D', 'E'],n),
'tract_id':range(1000)}).set_index(['subregion','tract_id'])
df_small_area_scale.head()
grp1 grp2 grp3 grp4
subregion tract_id
B 0 119 85 11 19
D 1 136 100 46 239
A 2 76 26 198 109
B 3 230 180 84 222
A 4 108 101 222 244
And, aggregating this by subregion
we get this:
df_small_area_scale.groupby(level=0).sum()
grp1 grp2 grp3 grp4
subregion
A 27241 27050 27471 26215
B 26507 24696 23315 24857
C 27474 28871 28882 28743
D 26671 26163 25077 27612
E 22739 23077 23797 24473
(And let's get the target shares for each subregion in each group)
summary_area_scale_shares = summary_area_scale.stack().groupby(level=0).apply(lambda x: x/float(x.sum()))
summary_area_scale_shares.head()
subregion
A grp1 0.244444
grp2 0.266667
grp3 0.244444
grp4 0.244444
B grp1 0.255319
dtype: float64
Second, what the small area data should sum to, at the subregional level.
These are subregional "known" distributions. It is these I would like the tract-level data to be adjusted to, such that when tracts are aggregated, they match, roughly, these regional totals in each group. Specifically, grp4
in subregion A
sums to 26,215, but per the target, it should be 22,000, so tracts in subregion A should see persons re-classified from grp4
to some of the other groups.
summary_area_scale = pd.DataFrame(data={'grp1':[22000,24000,21000,25000,28000],
'grp2':[24000,22000,26000,20000,28000],
'grp3':[22000,24000,21000,25000,28000],
'grp4':[22000,24000,21000,25000,28000],
'subregion':list('ABCDE')}).set_index('subregion')
summary_area_scale
grp1 grp2 grp3 grp4
subregion
A 22000 24000 22000 22000
B 24000 22000 24000 24000
C 21000 26000 21000 21000
D 25000 20000 25000 25000
E 28000 28000 28000 28000
One idea is to sample tracts within each subregion and then move people in some proportion to the overall number of people needing to be moved from one bin to another, although I am not sure if there is a clever way of doing it meeting the criteria above.
What is causing me problems is mainly this identifying a way of reallocating people across groups to match the subregional total while maintaining the record-level totals and not completely throwing away the pre-existing spatial distribution, which I want to keep as a signal (but adjusted to a now known different overall distribution).
Any ideas as to how to, in general terms, make a detail distribution fit a more aggregate one, beyond just sampling tracts and moving x people from grp4 -> grp3
, grp2 -> grp1
and whatever the difference is between the existing and the target distributions?
Placeholder code
This function is largely a lookup at a table with regional shares in each group, pushing that distribution to each tract, so it doesn't do anything but set up the data bindings.
def some_redistribution_algorithm(df):
"""
how many persons need to be moved across groups in each subregion?
minimal solution is to just take those shifts and apply uniformly
tracts keep the same counts, but the *distribution* across bins will change slightly
Quality criteria for algorithm:
- switch population at tract level such that
- tract-level population counts maintained
- Pre- and post-adjustment spatial distribution be largely unchanged
- change is not disproportional / dramatically impacting some tracts over others
(i.e. a tract with 10 grp4 population losing 8 would lose 80%, while a tract with 100 q4 hhs would lose 8%)
"""
adjustments = summary_area_scale.xs(df.name)
size = (adjustments).apply(lambda x: abs(x)).loc['grp4'].astype(np.int64)/df.shape[0]
print "Processing %s (%s tracts), beg. pop: %s, avg pop to move (here q4) %s" %(df.name,df.shape[0],
df.sum().loc['grp4'].astype(np.int64),size)
print 'Average pop per tract:'
print df.sum()/df.shape[0]
## tract-level distribution, if all tracts had the same distribution within each subregion (placeholder)
return df_small_area_scale.xs(df.name).mul(summary_area_scale_shares.unstack().xs(df.name),axis=1)
#samplerows= np.random.choice(a=df.index, size=size)
#df.loc[samplerows,:] = df.loc[samplerows,:]#, p=df.totalshare.tolist()),:]
df_small_area_scale.groupby(level=0).apply(some_redistribution_algorithm)