How can I improve this Python code to calculate Information Gain from Gini impurity?

Question

The following code is intended to calculate info gain from a dataset, using Gini impurity. I thought the code that I wrote is functional and should perform successfully in all cases, but there are several hidden test cases on Sololearn that it fails in.

My submission is below, but here is a link to the same at Sololearn: https://code.sololearn.com/cQEDIvXRgL3e

The pedantic version of my code, with editable inputs and exhaustive outputs, is at: https://code.sololearn.com/cO755SFZAUJ0

Is there an error or oversight in this code that I'm missing? There must be something wrong with it as it's failing in the hidden test cases, but I have no idea what that could be.

From what I can see in the visible test cases, Sololearn is sending even-numbered sets of 1s and 0s to the code, which is converting it into lists as per the lines below. In my test version these lines are swapped for empty lists, which I populate with 1s and 0s before running it. I've tried sets of both odd-numbered and even numbered lengths, with resulting splits being of equal or unequal length, and it doesn't seem to adversely affect the results.

s = [int(x) for x in input().split()]
a = [int(x) for x in input().split()]
b = [int(x) for x in input().split()]

#Function to get counts for set and splits, to be used in later formulae.
def setCount(n):
    return len(n)

Cs = setCount(s)
Ca = setCount(a)
Cb = setCount(b)

#Function to get sums of "True" values in each, for later formulae.
def tSum(x):
    sum = 0
    for n in x:
        if n == 1:
            sum += 1
    return sum

Ss = tSum(s)
Sa = tSum(a)
Sb = tSum(b)

#Function to get percentage of "True" values in each, for later formulae.
def getp(x, n):
    p = x/n
    return p

Ps = (getp(Ss, Cs))
Pa = (getp(Sa, Ca))
Pb = (getp(Sb, Cb))

#Function to get Gini impurity for each, to be used in final formula.
def gimp(p):
    return 2 * p * (1-p)

Hs = (gimp(Ps))
Ha = (gimp(Pa))
Hb = (gimp(Pb))

#Final formula, intended to output information gain to five decimal places.
infoGain = round((Hs - (Sa/Ss) * Ha - (Sb/Ss) * Hb),5)

print(infoGain)

score 0 · Answer 1 · answered Jul 08 '22 at 21:14

This question was answered on Sololearn by Tibor Santa, a mentor there. Their code that solved the test cases is much more to the point of the problem. It is pasted below, and can be found on Sololearn at: https://code.sololearn.com/cUoaMq6bzxP8/

The long and short of it is that, since the result is to be rounded to five decimals, it's entirely likely that different approaches to writing the code will elicit variances in the result. While my code wasn't "wrong," it wasn't the right approach to get the exact values behind the hidden test cases. It's also unnecessarily verbose.

The code that solved the test cases:

def gini(p):
    return 2 * p * (1-p)

def p(data):
    return sum(data) / len(data)

giniS = gini(p(S))
deltaA = gini(p(A)) * len(A) / len(S)
deltaB = gini(p(B)) * len(B) / len(S)
gain = giniS - deltaA - deltaB

How can I improve this Python code to calculate Information Gain from Gini impurity?

1 Answers1