0

I have a Dataframe like this:

Interesting           genre_1        probabilities
    1    no            Empty        0.251306
    2    yes           Empty        0.042043
    3     no          Alternative    5.871099
    4    yes         Alternative    5.723896
    5    no           Blues         0.027028
    6    yes          Blues         0.120248
    7    no          Children's     0.207213
    8    yes         Children's     0.426679
    9    no          Classical      0.306316
    10    yes         Classical      1.044135

I would like to perform GINI index on the same category based on the interesting column. After that, I would like to add such a value in a new pandas column.

This is the function to get the Gini index:

#Gini Function
#a and b are the quantities of each class
def gini(a,b):
    a1 = (a/(a+b))**2
    b1 = (b/(a+b))**2
    return 1 - (a1 + b1) 

EDIT* SORRY I had an error in my final desired Dataframe. Being interesting or not matters when it comes to choose prob(A) and prob(B) but the Gini score will be the same, because it will measure how much impurity are we getting to classify a song as interesting or not. So if the probabilities are around 50/50% then it will mean that the Gini score will reach it maximum (0.5) and this is because is equally possible to just be mistaken to choose interesting or not.

So for the first two rows, the Gini index will be:

a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612

Then I would like to get something like:

 Interesting           genre_1        percentages.  GINI INDEX
        1    no            Empty        0.251306         0.245559831601612
        2    yes           Empty        0.042043         0.245559831601612
        3     no          Alternative    5.871099         0.4999194135183881
        4    yes         Alternative    5.723896.     0.4999194135183881
        5    no           Blues         0.027028          ..
        6    yes          Blues         0.120248
        7    no          Children's     0.207213
        8    yes         Children's     0.426679
        9    no          Classical      0.306316          ..
        10    yes         Classical      1.044135         ..
Javiss
  • 765
  • 3
  • 10
  • 24
  • Is the gini index dependent on values of `a` and `b`? For example, if `a=yes; b=Empty` will gini index will still be 0.5? – SKPS Feb 11 '20 at 01:45
  • I'm not clear on what the inputs of `gini` are for each row. For instance, why was 2.513 repeated on the first row and different values on the second. – busybear Feb 11 '20 at 01:52
  • I think this is similar to a rolling mean. The next probability is the b value. If there is no a value (first row), then you use the first value for both a and b. The question is confusing, because the Interesting column does not factor into the calculation of the GINI coefficient, since at no point do it's values (yes or no) translate to arguments which are useable to the calculation of the GINI coefficient, which are obviously all numeric inputs. – J.Doe Feb 11 '20 at 02:15
  • @SKPS you are right, take a look at my edit – Javiss Feb 11 '20 at 09:22
  • @busybear thanks for noticing, I had a mistake in my final output and in the input of the GINI index – Javiss Feb 11 '20 at 09:23
  • @J.Doe yeah, it was very confusing. Can you check my edit? I think now is easier. There is no need to "create a rolling GINI" but thanks, that was an amazing approach! – Javiss Feb 11 '20 at 09:24

2 Answers2

1

I am not sure how the Interesting column plays into all of this, but I highly recommend that you make the new column by using numpy.where(). The syntax would be something like:

import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)
Lucas H
  • 927
  • 8
  • 15
1

Ok, I think I know what you mean. The code below does not care, if the Interesting value is 'yes' or 'no'. But what you want, is to calculate the GINI coefficient in two different ways for each row based on the value in the Interesting value of that row. So if interesting == no, then the result is 0.5, because a == b. But if interesting is 'yes', then you need to use a = probability[i] and b = probability[i+1]. So skip this section for the updated code below.

import pandas as pd


df = pd.read_csv('df.txt',delim_whitespace=True)

probs = df['probabilities']


def ROLLING_GINI(probabilities):

    a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
    b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
    res = 1 - (a1 + b1)
    yield res

    for i in range(len(probabilities)-1):
        a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
        b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
        res = 1 - (a1 + b1)
        yield res


df['GINI'] = [val for val in ROLLING_GINI(probs)]

print(df)

This is where the real trouble starts, because if I understand your idea correctly, then you cannot calculate the last GINI value, because your dataframe won't allow it. The important bit here is that the last Interesting value in your dataframe is 'yes'. This means I have to use a = probability[i] and b = probability[i+1]. But your dataframe doesn't have a row number 11. You have 10 rows and on row i == 10, you'd need a probability in row 11 to calculate a GINI coefficient. So in order for your idea to work, the last Interesting value MUST be 'no', otherwise you will always get an index error.

Here's the code anyways:

import pandas as pd

df = pd.read_csv('df.txt',delim_whitespace=True)


def ROLLING_GINI(dataframe):

    probabilities = dataframe['probabilities']
    how_to_calculate = dataframe['Interesting']

    for i in range(len(dataframe)-1):

        if how_to_calculate[i] == 'yes':
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
            b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
            res = 1 - (a1 + b1)
            yield res

        elif how_to_calculate[i] == 'no':
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            res = 1 - (a1 + b1)
            yield res


GINI = [val for val in ROLLING_GINI(df)]

print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])

EDIT NUMBER THREE (Sorry for the late realization):

So it does work if I apply the indexing correctly. The problem was that I wanted to use the Next probability, not the previous one. So it's a = probabilities[i-1] and b = probabilities[i]

import pandas as pd

df = pd.read_csv('df.txt',delim_whitespace=True)


def ROLLING_GINI(dataframe):

    probabilities = dataframe['probabilities']
    how_to_calculate = dataframe['Interesting']

    for i in range(len(dataframe)):

        if how_to_calculate[i] == 'yes':
            a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
            b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
            res = 1 - (a1 + b1)
            yield res

        elif how_to_calculate[i] == 'no':
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            res = 1 - (a1 + b1)
            yield res


GINI = [val for val in ROLLING_GINI(df)]

print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
J.Doe
  • 224
  • 1
  • 4
  • 19
  • I think your first solution works on my last edit. I just needed to take the "even indices" (as the first two rows of each genre will have the same value) – Javiss Feb 11 '20 at 09:39
  • Nice. Btw if your question was answerd by one ofthe users, you should accept it by clicking on the green tick on the left ^^ – J.Doe Feb 12 '20 at 09:51