Attempting to perform classification on a large ~2500*~4000 features and have a confidence document accompanying the training data.
I am attempting to use the confidence values as the class_weight
parameter of a classifier and am having trouble understanding the dictionary format that class_weight requires.
I've been looking for solutions to an error due to using a dictionary in the format {0:1, 1:0.66, 2:0.66, 3:1 ...} but have recently learned that the parameter reqires the form [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] [https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier]
I suppose I don't understand the format of [{a:b , c:d}...] I believe d is the weight but am unsure of the rest of the structure or how to get there from my csv file.
What I have so far:
>>> with open('confidence.csv') as csvfile:
>>> reader = csv.DictReader(csvfile, delimiter=",")
>>> confidence_dict={int(row['ID'])-1:int(float(row['confidence'])) for row in reader} #float(row['confidence'])
>>> print(confidence_dict)
{0: 0.66, 1: 1, 2: 0.66, 3: 0.66, 4: 1, ...}
>>> print(X)
v0 v1 v2 v3 ...
0 1.413 0.874 0.506 1.790
1 0.253 0.253 0.486 1.864
2 1.863 0.174 0.018 1.789
3 0.253 0.213 0.486 1.834
...
>>> print(y)
0 0
1 0
2 1
3 1
...
>>> linearSVC = LinearSVC(random_state=0, tol=1e-6, class_weight=confidence_dict)
>>> linearSVC.fit(X, y)
Class label {} not present.
returned when attempting to use the class weights in the current dictionary form. This does not occur if no class weight is entered.
ValueError: Class label 2 not present.
There is limited information about this topic online so I thought I would try to make a clear post and hopefully get a grasp of how to implement this.