SQL syntax for complex GROUP BY with OVER statement: calculating Gini coefficient for multiple sets

Question

I want to calculate the Gini coefficient for a number of sets, containing in a two-column table (here called #cits) containing a value and a set-ID. I have been experimenting with different Gini-coefficient calculations, described here (StackExchange query) and here (StackOverflow question with some good replies). Both of the examples only calculate one coefficient for one table, whereas I would like to do it with a GROUP BY clause.

The #cits table contains two columns, c and cid, being the value and set-ID respectively.

Here is my current try (incomplete):

select count(c) as numC, 
sum(c) as totalC, 
(select row_number() over(order by c asc, cid) id, c from #cits) as a 
from #cits group by cid

selecting numC and totalC works well, of course, but the next line is giving me a headache. I can see that the syntax is wrong, but I can't figure out how to assign the row_number() per c per cid.

EDIT: Based on the suggestions, I used partition, like so:

select cid,sumC = sum(a.id * a.c) 
into #srep 
from (
   select cid,row_number() over (partition by cid order by c asc) id, 
   c 
   from #cits
) as a 
group by a.cluster_id1 


select count(c) as numC, 
    sum(c) as totalC, b.sumC 
into #gtmp 
from #cits a
    join #srep b 
        on a.cid = b.cid 
group by a.cid,b.sumC


select 
    gini = 2 * sumC / (totalC * numC) - (numC - 1) / numC 
from #gtmp

This almost works. I get a result, but it is >1, which is unexpected, as the Gini-coefficient should be between 0 and 1. As stated in the comments, I would have preferred a one-query solution as well, but it is not a major issue at all.

I think you are looking for the partition part of row_number, which is your "grouping". `row_number() over (partition by c, cid order by c asc, cid)` — S3S, Nov 22 '16 at 21:06
It seems like you are right - I have been looking into partition since asking this question here, and think I am close to a solution - although it required me to use two queries... I wanted one, for the beauty of it ;) — ipoga, Nov 22 '16 at 21:09
I would need to see some sample data and expected results to aid in understanding this question.... Do you have multiple rows in #Cits for each cid and C? — xQbert, Nov 22 '16 at 21:14
@scsimon It is not a big problem, please see my edit. If you want to, I would be very happy to see how you would do that, though, as a learning experience. — ipoga, Nov 22 '16 at 21:25
I have a working solution in Impala for Gini Coefficient Calculation. Let me know if that helps. — Indranil Gayen, Jan 29 '17 at 08:22

score 1 · Answer 1 · answered Nov 22 '16 at 21:15

You can "partition" the data so row numbering would start over for each ID... but I'm not sure this is what you're after..

I'm assuming you want the CID displayed as you are grouping by it.

select count(c) as numC
     , sum(c) as totalC
     , row_number() over(partition by cID order by c asc) as a
     , cid 
from #cits group by cid

Note you don't need the subquery.

Yeah this isn't likely right.

output

NumC TotalC A CID
24   383    1  1
15   232    1  2

This was a great help in getting on the right track, please see my edit. — ipoga, Nov 22 '16 at 21:24

score 0 · Answer 2 · answered Nov 22 '16 at 21:18

If I'm understanding correctly, you need numC and totalC for each C in a cid set, as well as the position of the c inside of that set. This should get you what you need:

select
    rn.cid,
    rn.c,
    row_number() over (partition by rn.cid order by rn.c) as id,
    agg.numC,
    agg.totalC
from #cits rn
    left outer join
    (
        select
            cid,
            count(c) as numC,
            sum(c) as totalC
        from #cits
        group by cid
    ) agg
        on rn.cid = agg.cid

note that I don't have any data to run this on, so it might need to be tweaked — DForck42, Nov 22 '16 at 21:18

SQL syntax for complex GROUP BY with OVER statement: calculating Gini coefficient for multiple sets

2 Answers2