1

Let's say I have a sample of N individuals and a random variable X which represent their annual income in a foreign currency. An example of X could be the following:

15000
11000
9000
4000
4000
3900
3800
3600
3400
1000
900
800
700
700
400
300
300
300
200
100

Now I should "sample" the 20 components of X in 3 "ordered" sub-groups (non necessary with the same number of components) so that they have (approximately) the same Gini Coefficient.

As a reminder for the Gini coefficient: just calculate the % of each income over the total income (ex p1=1500/(1500+1100+...), p2=1100/(1500+1100+...), ..., p20=100/(1500+1100+...)), then the cumulative % values (ex c1=0+p1, c2=p1+p2, ..., c20=p19+p20=1), then calculate the area underlying the cumulative (A=(c1+...+c20-0.5)/(20)-0.5) and therefore the Gini G=2*A.

This can easily be done by brute force: divide the sample in 3, calculate the Gini for the three samples and try to move from/to the middle sample upper and lower components to se whether differences in terms of Gini improve or worsen off. However, is very time consuming to be done manually (on Excel for example), especially when I have a very big data set.

I suspect there is a more elegant solution. I'm open to both Python and R.

ADDITIONAL DETAILS The output would be something like this: for X

        1         2         3 
     1500      3900       400
     1100      3800       300
     9000      3600       300
     4000      3400       300
               1000       200
                900       100
                800
                700
                700

for G, the actual Gini coefficient of the three subgroups

        1         2         3 
      0.4      0.41      0.39 
toyo10
  • 121
  • 1
  • 14
  • 1
    can you write your formulas correctly? – Onyambu Jul 03 '18 at 05:38
  • @Onyambu give me a hint please. I can't figure out what's missing. – toyo10 Jul 03 '18 at 05:56
  • This is interesting, I am giving it a go. Out of curousity, why do you want to do this? – Peter Ellis Jul 03 '18 at 05:58
  • first you have `prop.table(X)` now, do you need to group in 3or in 2? Also what is the formula for A?? `A=c1+c2+...+c20-0,5` are you summing then you subtract 0? it doesn't make sense to subtract zero.. also what is the `,5` about?? Also in the denominator `/(20)-0,5)` I do not understand what `0,5` is or maybe you mean `0.5`?? also if it is `0.5` do you divide then subtract or first subtract then divide? Can you please write your formula correctly? – Onyambu Jul 03 '18 at 06:07
  • @Onyambu sorry I misplaced `,` instead of `.`. The Italian system doesn't help... – toyo10 Jul 03 '18 at 06:14
  • @PeterEllis just an idea for clustering data. Thank you for your help, I'm gonna have a try later and accept the answer – toyo10 Jul 03 '18 at 06:18

2 Answers2

1

OK here's a method in R that at least automates the brute force. It tries 1,000 different random permutations of the population and picks the one when the Gini coefficients have the lowest standard deviation. It works well and practically instantly with your toy dataset.

library(ineq)

x <-c(1500, 1100, 9000, 4000, 4000, 3900, 3800, 3600, 3400,
      1000, 900, 800, 700, 700, 400, 300, 300, 300, 200, 100)

Gini(x)
# 0.534

n <- length(x)


best_sd <- 1

for(i in 1:1000){
  grouping <- sample(1:3, n, replace = TRUE)
  ginis <- tapply(x, grouping, Gini)
  s <- sd(ginis)
  if(s < best_sd){
    best_sd <- s
    best_grouping <- grouping
    best_i <- i}
}

best_sd
# 0.000891497

tapply(x, best_grouping, Gini)
#         1         2         3 
# 0.5052780 0.5042017 0.5035088 

It's not guaranteed to be the best but it obviously is fairly close. A more elegant solution would find ways of picking and choosing which points to swap around as it gets close, but that would probably slow it computationally, and certainly take much more developer time!

With a larger dataset of 100,000 observations it still only takes 12 seconds on my laptop, so it scales up ok.

Peter Ellis
  • 5,694
  • 30
  • 46
  • As an interesting addendum to this, the Gini coefficient of the three groups will be different each time. So running it a second time they converged on 0.428 (not 0.505); then next time on 0.468. – Peter Ellis Jul 03 '18 at 06:13
  • Something strage happens when I inspect `best_grouping` to actually see the groups: `[1] 3 3 3 1 1 3 2 3 3 3 2 3 1 3 1 3 1 1 3 2`. If I understand it correctly, it means that `1500` must be in group #3, `1100` group #3, and so on right? If it is so, it seems the code is balancing the clusters by putting in the "very high values cluster" (#3) also a very low value of `300`. Group #1 seems to both include high value and low values as well. That's pretty strange. As I have ordered the data yet, clusters should follow that order. – toyo10 Jul 03 '18 at 15:09
  • I don't see why you think that's strange? There's going to be many many ways to create clusters with similar Gini coefficients to eachother; some versions will be similar to eachother in other ways be similar to the main population, some versions will make odd little enclaves that might be completely different from eachother (eg different mean and variance) while still having the same Gini coefficient. – Peter Ellis Jul 04 '18 at 05:11
0

It's not very polite to answer its own question, but I think it's worth sharing it. This is what I wrote in R by taking inspiration from Peter Ellis answer above. Any comment/improvement ideas are welcome:

library(ineq)
x <-c(15000, 11000, 9000, 4000, 4000, 3900, 3800, 3600, 3400,
      1000, 900, 800, 700, 700, 400, 300, 300, 300, 200, 100)
n <- length(x)

best_sd <- 1
for(d in 2:n-2) for(u in 3:n-2){
  g <- c(Gini(x[1:d]), Gini(x[d+1:u]), Gini(x[u+1:n]))
  s <- sd(g) 
  if(s < best_sd){
    best_sd <- s
    best_grouping <- c(d,u)
    best_g <- g
  }
}

best_sd
#[1] 0.005250825
best_grouping
#[1]  9 11
best_g
#[1] 0.3046409 0.3144654 0.3127660
toyo10
  • 121
  • 1
  • 14