The following code is intended to calculate info gain from a dataset, using Gini impurity. I thought the code that I wrote is functional and should perform successfully in all cases, but there are several hidden test cases on Sololearn that it fails in.
My submission is below, but here is a link to the same at Sololearn: https://code.sololearn.com/cQEDIvXRgL3e
The pedantic version of my code, with editable inputs and exhaustive outputs, is at: https://code.sololearn.com/cO755SFZAUJ0
Is there an error or oversight in this code that I'm missing? There must be something wrong with it as it's failing in the hidden test cases, but I have no idea what that could be.
From what I can see in the visible test cases, Sololearn is sending even-numbered sets of 1s and 0s to the code, which is converting it into lists as per the lines below. In my test version these lines are swapped for empty lists, which I populate with 1s and 0s before running it. I've tried sets of both odd-numbered and even numbered lengths, with resulting splits being of equal or unequal length, and it doesn't seem to adversely affect the results.
s = [int(x) for x in input().split()]
a = [int(x) for x in input().split()]
b = [int(x) for x in input().split()]
#Function to get counts for set and splits, to be used in later formulae.
def setCount(n):
return len(n)
Cs = setCount(s)
Ca = setCount(a)
Cb = setCount(b)
#Function to get sums of "True" values in each, for later formulae.
def tSum(x):
sum = 0
for n in x:
if n == 1:
sum += 1
return sum
Ss = tSum(s)
Sa = tSum(a)
Sb = tSum(b)
#Function to get percentage of "True" values in each, for later formulae.
def getp(x, n):
p = x/n
return p
Ps = (getp(Ss, Cs))
Pa = (getp(Sa, Ca))
Pb = (getp(Sb, Cb))
#Function to get Gini impurity for each, to be used in final formula.
def gimp(p):
return 2 * p * (1-p)
Hs = (gimp(Ps))
Ha = (gimp(Pa))
Hb = (gimp(Pb))
#Final formula, intended to output information gain to five decimal places.
infoGain = round((Hs - (Sa/Ss) * Ha - (Sb/Ss) * Hb),5)
print(infoGain)