0

I am trying to apply the formula:

Formula

I am unclear why this does not work:

def gini_node(node):
    count = sum(node)
    gini = functools.reduce(lambda p,c: p + (1 - (c/count)**2), node)
    print(count, gini)
    print(1 - (node[0]/count)**2, 1 - (node[1]/count)**2)
    return gini

Evaluating gini([[175, 330], [220, 120]]) prints:

505 175.57298304087834
0.8799137339476522 0.5729830408783452
340 220.87543252595157
0.5813148788927336 0.8754325259515571

note that the second print statement prints the figures that I want to sum, given the example input. the return value (the first print statement's second value) should be a number between 0 and 1.

What is wrong with my reduce?

Full function I am trying to write is:

import functools

def gini_node(node):
    count = sum(node)
    gini = functools.reduce(lambda p,c: p + (1 - (c/count)**2), node)
    print(count, gini)
    print(1 - (node[0]/count)**2, 1 - (node[1]/count)**2)
    return gini

def gini (groups):
    counts = [ sum(node) for node in groups ]
    count = sum(counts)
    proportions = [ n/count for n in counts ]

    return sum([ gini_node(node) * proportion for node, proportion in zip(groups, proportions)])

# test
print(gini([[175, 330], [220, 120]]))
asynts
  • 2,213
  • 2
  • 21
  • 35
roberto tomás
  • 4,435
  • 5
  • 42
  • 71

1 Answers1

1

The way reduce works is it takes 2 arguments from it's container(only 2)
https://docs.python.org/3/library/functools.html#functools.reduce
and performs the operation given to it, then keeps on iterating the same operation over the list using 2 arguments.

gini = functools.reduce(lambda p,c: p + (1 - (c/count)**2), node)

For first node (175, 330) this lambda would take 175 in p and 330 in c and return you 175.57298304087834 instead we want

gini = functools.reduce(lambda p,c: (1 - (p/count)**2) + (1 - (c/count)**2), node)


I have added some print statements, let's see their output.

import functools

def gini_node(node):
    count = sum(node)
    gini = functools.reduce(lambda p,c: (1 - (p/count)**2) + (1 - (c/count)**2), node)
    print(count, gini)
    print(1 - (node[0]/count)**2, 1 - (node[1]/count)**2)
    return gini

def gini (groups):
    counts = [ sum(node) for node in groups ]
    count = sum(counts)
    proportions = [ n/count for n in counts ]
    print(count, counts, proportions) #This
    gini_indexes = [ gini_node(node) * proportion for node, proportion in zip(groups, proportions)]
    print(gini_indexes) #And this
    return sum(gini_indexes)

# test
print(gini([[175, 330], [220, 120]]))

rahul@RNA-HP:~$ python3 so.py
845 [505, 340] [0.5976331360946746, 0.40236686390532544]
505 1.4528967748259973 #Second number here is addition of 2 numbers below
0.8799137339476522 0.5729830408783452
340 1.4567474048442905 #Same for this
0.5813148788927336 0.8754325259515571
#The first number of this list is first 1.45289677.... * 0.597633...
#Basically the addition and then multiplication by it's proportion.
[0.868299255961099, 0.5861468847894187]
#What you are returning to final print statement is the addition of gini co-effs of each node i.e the sum of the list above
1.4544461407505178

An easier way to go around if there are more than 2 arguments(*)

 gini = sum([(1 - (p/count)**2) for p in node])

Works the same are the reduce() function defined above.

Rahul
  • 576
  • 1
  • 5
  • 9
  • Rahul this was very good of you, but I'm afraid you are wrong. a gini coefficient is always between 0 and 1. There are groups like "country a, or array b", etc and classes within each group ("people with short hair", "characteristic b", etc). each group's gini index must sum to one, and must be multiplied by the proportion before summed so that the overall value also is between 0 and 1. Zero represents absolute equality. – roberto tomás May 15 '19 at 11:17
  • What I am calculating is elegantly described here: http://shlegeris.com/gini and fully described here (but I dont understand tables in R): http://www.learnbymarketing.com/481/decision-tree-flavors-gini-info-gain/ a different, and far more intuitive way to calculate it, is here: https://planspace.org/2013/06/21/how-to-calculate-gini-coefficient-from-raw-data-in-python/ But for now I should implement this formula. – roberto tomás May 15 '19 at 11:18
  • However, t didn't take long *at all* to get what I needed from your last code block! thank you!!! `gini = 1 - sum([ (p/count)**2 for p in node ])` – roberto tomás May 15 '19 at 11:30
  • So how come here the summation goes above 1, is it because of negative co-relation ? – Rahul May 15 '19 at 12:51