0

I'm having some difficulties trying to calculate the gini coefficient using binned census data, and would really appreciate any help.

My data looks a little something like this (but with 14,000 observations of 13 variables).

location <- c('A','B','C', 'D', 'E', 'F')  
no_income <- c(20, 1, 40, 79, 12, 2)
income1 <- c(13, 4, 56, 17, 9, 4)
income2 <- c(27, 39, 49, 12, 19, 0)
income3 <- c(0, 1, 4, 3, 27, 0)

df <- data.frame(location, no_income, income1, income2, income3)

So for each observation there is a location given, and then a series of columns indicating how many households in the area earn within the given income bracket (so for location A, 20 households earn $0, 13 earn income1, 27 income2, and 0 income3).

I've created an empty column to return the results to:

df$gini = 0

I've then created a numerical vector (x) containing the income amount I want to use for each income bin

x <- c(0, 300, 1000, 2000)

I've been trying to use the gini function within the reldist package, and have written the following for loop to cycle through each row of the data, apply the gini function and return the output to a new column.

for (i in 1:nrow(samp)){ 
     w <- samp[i,2:5] 
     df$gini <- gini(x, w=rep(1, length=length(x)))
     }

The problem is that the ouput returned is currently identical for each row, which is obviously not correct. I'm relatively new to this though, and not sure what I'm doing wrong...

Sarlo
  • 3
  • 4

1 Answers1

0

R vectorises operations, so there's often no need to write a loop; in this case you do because of how the function works. You also don't often need to initialise a container (sometimes you might, but rarely).

Here's a working example using apply to loop over the rows:

# setup
install.packages("reldist")
library(reldist)

# dummy data
df = data.frame(ID=letters,
    Bin1=rpois(26, 3),
    Bin2=rpois(26, 8),
    Bin3=rpois(26, 1))

inc = c(0, 300, 1000)

# new column with gini
df$gini = apply(df[, 2:4], 1, function(i){
    gini(inc, i)
})

Worth noting that gini() defaults the weights argument to =rep(1, length=length(x)), so if that's what you want you don't need to define it.

Edit: I've added inclusion of weights, based on what I read in the manual: https://cran.r-project.org/web/packages/reldist/reldist.pdf.

MikeRSpencer
  • 1,276
  • 10
  • 24
  • Thank you - this works perfectly, and also really helped me to understand what I was doing wrong and why. This is why I love this site - you get help with the immediate problem, but in a way that means you also get to continuously improve you knowledge and understanding. – Sarlo Dec 03 '15 at 09:33
  • Updating my earlier comment: Is there a way to include the weights in this? The columns in my data contain the number of people who have each income (so the weight) not the income itself... – Sarlo Dec 03 '15 at 10:25
  • Perhaps you could edit your question to include some sample data? Happy to help, it's also good for learning! – MikeRSpencer Dec 03 '15 at 15:23
  • Sample data added. Thanks again. – Sarlo Dec 04 '15 at 09:06
  • I've not tested it, but that should give you enough to get up and running. – MikeRSpencer Dec 04 '15 at 11:46
  • Thanks - that works. And I didn't know you could wrap inside a function like that, so really useful to learn that as well! – Sarlo Dec 04 '15 at 14:03
  • Wrapping a function in an 'apply call (if you're on a Linux machine and possibly Apple) is the easiest way to get R to run in parallel; e.g. `library(parallel)`, `mclapply(input, mc.cores=4, function(i){whatever})` – MikeRSpencer Dec 04 '15 at 15:39
  • Thanks! Very helpful to know. – Sarlo Dec 07 '15 at 07:22