10

One way to calculate the Gini coefficient of a sample is using the relative mean difference (RMD) which is 2 times the Gini coefficient. RMD depends on the mean difference which is given by:

enter image description here

So I need to calculate each difference between pair of elements in a sample (yi - yj). It took me quite a bit to figure out a way to do it but I want to know if there is a function that does this for you.

At first I tried this but I bet it's very slow in big data sets (by the way, s is the sample):

In [124]:

%%timeit
from itertools import permutations
k = 0
for i, j in list(permutations(s,2)):
    k += abs(i-j)
MD = k/float(len(s)**2)
G = MD / float(mean(s))
G = G/2
G
10000 loops, best of 3: 78 us per loop

Then I tried the following which is less understandable but quicker:

In [126]:
%%timeit
m = abs(s - s.reshape(len(s), 1))
MD = np.sum(m)/float((len(s)**2))
G = MD / float(mean(s))
G = G/2
G
10000 loops, best of 3: 46.8 us per loop

Is there something efficient but easy to generalize? For example, what if I want to sum over three indices?

This is the sample I was using:

sample = array([5487574374,     686306,    5092789,   17264231,   41733014,
         60870152,   82204091,  227787612,  264942911,  716909668,
        679759369, 1336605253,  788028471,  331434695,  146295398,
         88673463,  224589748,  128576176,  346121028])

gini(sample)
Out[155]:
0.2692307692307692

Thanks!

r_31415
  • 8,752
  • 17
  • 74
  • 121
  • Is your question "I want an efficient way to calculate MD given the formula for MD" – Rusty Rob Dec 27 '12 at 00:23
  • Not really. MD is just an example to motivate my question. It's more like "I want an efficient and easy to understand way to sum over indices in general". – r_31415 Dec 27 '12 at 00:26
  • @RobertSmith Can you provide a sample so that I can compare the result of my code with yours? – satoru Dec 27 '12 at 00:31
  • Sure. Give me a few seconds. – r_31415 Dec 27 '12 at 00:32
  • I added a sample. See my update. – r_31415 Dec 27 '12 at 00:38
  • Is there a formulation that can use `(y_i - y_j)**2` for the kernel? If so you can use fast convolution in the frequency domain. Depending on the size of `s` and the particular FFT implementation this could be much faster, or no faster at all. – mtrw Dec 27 '12 at 00:54
  • Do you have some reference about it? I'm not sure I understand what you mean. – r_31415 Dec 27 '12 at 01:11
  • FWIW I find that using `k=2*scipy.spatial.distance.pdist(s.reshape((len(s),1)),metric='cityblock').sum()` is several times faster than your second approach for longer arrays but slower for your `sample`. – DSM Dec 27 '12 at 01:13
  • Followup: of course, since it actually constructs an array, it'll hit memory problems. :-/ – DSM Dec 27 '12 at 01:19
  • I'm not sure that approach is more understandable but it's definitely interesting. It also works for any number of indices. – r_31415 Dec 27 '12 at 01:23
  • @DSM the OP's `s - s.reshape(len(s), 1)` is also constructing an array. – Jaime Dec 27 '12 at 05:34
  • @Jaime: yeah, but I can't see how to salvage it, whereas the first -- and admittedly slow -- OP solution could (if we lose the `list`) not use any extra space beyond `s` itself. – DSM Dec 27 '12 at 05:37

1 Answers1

1

For the MD example you give, it can be exploited by sorting, You can achieve O(N*Log(N)) instead of O(N^2)

y = [2,3,2,34]

def slow(y):
    tot = 0
    for i in range(len(y)):
        for j in range(len(y)):
            if i != j:
                tot += abs(y[i] - y[j])
    return float(tot)/len(y)**2

print slow(y)

def fast(y):
    sorted_y = sorted(y)
    tot = 0
    for i, yi in enumerate(sorted_y):
        smaller = i
        bigger = len(y) - i - 1
        tot += smaller * yi - bigger * yi
    return float(2*tot)/len(y)**2

print fast(y)

Often you will have to use dynamic programming or other techniques to make these faster, I'm not sure if there is a "one method fits all" solution.

Rusty Rob
  • 16,489
  • 8
  • 100
  • 116
  • Actually, your fast function is 208 times slower than In[126] (until the point of calculating MD). Most likely, a good solution needs to avoid any loops. – r_31415 Dec 27 '12 at 01:08
  • Hmm, how large is the sample? with my testing, large samples are a lot faster with my fast() method. In fact reshape() can't even run on samples larger than 10^4 it seems, whereas mine runs in less than a second on samples of size 10^6. For small samples your method may be faster due to the overhead of function calls etc. – Rusty Rob Dec 27 '12 at 02:03
  • Right. For small samples, the second code I showed is quicker but it becomes unbearably slower with larger samples, unlike yours that is fairly quick because you're using that trick of realizing that the original sum can be express as a sum of multiplications. Do you know how to extend this result to consider more indices? – r_31415 Dec 27 '12 at 03:22
  • ill have a think. probably by using inclusion exclusion or binary or ternary search. – Rusty Rob Dec 27 '12 at 04:03
  • So if it were yi + yj - yk, you could sort, then go through for each k, and have a rolling interval for the min yi and max yj that gives a sum greater than yk. that shows how much yk contributes. Similar for higher dimensions i suppose. Also you may be interested to read this answer which is very weakly related http://stackoverflow.com/questions/12946497/sub-on2-algorithm-for-counting-nested-intervals – Rusty Rob Dec 28 '12 at 05:01