Let's say I have a sample of N
individuals and a random variable X
which represent their annual income in a foreign currency. An example of X
could be the following:
15000
11000
9000
4000
4000
3900
3800
3600
3400
1000
900
800
700
700
400
300
300
300
200
100
Now I should "sample" the 20
components of X
in 3 "ordered" sub-groups (non necessary with the same number of components) so that they have (approximately) the same Gini Coefficient.
As a reminder for the Gini coefficient: just calculate the % of each income over the total income (ex
p1=1500/(1500+1100+...)
,p2=1100/(1500+1100+...)
, ...,p20=100/(1500+1100+...)
), then the cumulative % values (exc1=0+p1
,c2=p1+p2
, ...,c20=p19+p20=1
), then calculate the area underlying the cumulative (A=(c1+...+c20-0.5)/(20)-0.5
) and therefore the GiniG=2*A
.
This can easily be done by brute force: divide the sample in 3, calculate the Gini for the three samples and try to move from/to the middle sample upper and lower components to se whether differences in terms of Gini improve or worsen off. However, is very time consuming to be done manually (on Excel for example), especially when I have a very big data set.
I suspect there is a more elegant solution. I'm open to both Python
and R
.
ADDITIONAL DETAILS
The output would be something like this: for X
1 2 3
1500 3900 400
1100 3800 300
9000 3600 300
4000 3400 300
1000 200
900 100
800
700
700
for G
, the actual Gini coefficient of the three subgroups
1 2 3
0.4 0.41 0.39