I want to calculate the Gini coefficient for a number of sets, containing in a two-column table (here called #cits
) containing a value and a set-ID. I have been experimenting with different Gini-coefficient calculations, described here (StackExchange query) and here (StackOverflow question with some good replies). Both of the examples only calculate one coefficient for one table, whereas I would like to do it with a GROUP BY
clause.
The #cits
table contains two columns, c
and cid
, being the value and set-ID respectively.
Here is my current try (incomplete):
select count(c) as numC,
sum(c) as totalC,
(select row_number() over(order by c asc, cid) id, c from #cits) as a
from #cits group by cid
selecting numC and totalC works well, of course, but the next line is giving me a headache. I can see that the syntax is wrong, but I can't figure out how to assign the row_number()
per c per cid.
EDIT:
Based on the suggestions, I used partition
, like so:
select cid,sumC = sum(a.id * a.c)
into #srep
from (
select cid,row_number() over (partition by cid order by c asc) id,
c
from #cits
) as a
group by a.cluster_id1
select count(c) as numC,
sum(c) as totalC, b.sumC
into #gtmp
from #cits a
join #srep b
on a.cid = b.cid
group by a.cid,b.sumC
select
gini = 2 * sumC / (totalC * numC) - (numC - 1) / numC
from #gtmp
This almost works. I get a result, but it is >1, which is unexpected, as the Gini-coefficient should be between 0 and 1. As stated in the comments, I would have preferred a one-query solution as well, but it is not a major issue at all.