2

Attempting to perform classification on a large ~2500*~4000 features and have a confidence document accompanying the training data.

I am attempting to use the confidence values as the class_weight parameter of a classifier and am having trouble understanding the dictionary format that class_weight requires. I've been looking for solutions to an error due to using a dictionary in the format {0:1, 1:0.66, 2:0.66, 3:1 ...} but have recently learned that the parameter reqires the form [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] [https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier]

I suppose I don't understand the format of [{a:b , c:d}...] I believe d is the weight but am unsure of the rest of the structure or how to get there from my csv file.

What I have so far:


>>> with open('confidence.csv') as csvfile:
>>>    reader = csv.DictReader(csvfile, delimiter=",")
>>>    confidence_dict={int(row['ID'])-1:int(float(row['confidence'])) for row in reader} #float(row['confidence'])

>>> print(confidence_dict)
{0: 0.66, 1: 1, 2: 0.66, 3: 0.66, 4: 1, ...}

>>> print(X)
    v0    v1    v2    v3     ...
0   1.413 0.874 0.506 1.790
1   0.253 0.253 0.486 1.864 
2   1.863 0.174 0.018 1.789
3   0.253 0.213 0.486 1.834
...

>>> print(y)
0   0
1   0
2   1
3   1
...

>>> linearSVC = LinearSVC(random_state=0, tol=1e-6, class_weight=confidence_dict)
>>> linearSVC.fit(X, y)

Class label {} not present. returned when attempting to use the class weights in the current dictionary form. This does not occur if no class weight is entered.

ValueError: Class label 2 not present.

There is limited information about this topic online so I thought I would try to make a clear post and hopefully get a grasp of how to implement this.

Cameron L
  • 86
  • 12
  • Inexperienced poster so any advice or feedback in terms of that also appreciated – Cameron L May 20 '19 at 03:01
  • I'm pretty sure you are providing more class weights in `confidence_dict` than actually present in `y`. The error says there's no class label "2", present in your `y` vector. What values are in your `y` vector? – JimmyOnThePage May 20 '19 at 04:11
  • The length of `X`, `y`, and `confidence_dict` are all the same. But you did put me on to something. Where I was copying a DataFrame it was becoming a `pandas.core.series.Series`. I just had to cast it in a pd.DataFrame() to get the desired `pandas.core.frame.DataFrame` type. But now I have the same issue but `ValueError: Class label 3 not present.` – Cameron L May 20 '19 at 05:22
  • What I'm saying is that the number of unique values in `y`, is not the same number of classes:class_weights you are providing in `confidence_dict`. The classifier can't attach a class weight to a label is doesn't see in the results, which is your `y`-vector – JimmyOnThePage May 20 '19 at 05:24
  • Forgive me if I'm not understanding correctly. The number of values in y (consisting of 1-dimensional dataframe of 1s and 0s) is equal to the number of pairs in `{0: 0.66, 1: 1, 2: 0.66, 3: 0.66, 4: 1, ..., len(y):0.66}`. y[0] correlates to 0:0.66 – Cameron L May 20 '19 at 05:38
  • 1
    Ah i see. You are misunderstanding what the class weight parameter is for. Class weight is meant to handle imbalanced classes, meaning you have more of a certain class than others. If your `y` vector is only 1's and 0's, you have only 2 classes. You can only provide one weight per class – JimmyOnThePage May 20 '19 at 05:43
  • Ah damnit, do you know of any feature that would do what I set out for? Applying the confidence of the output of the training data per row. – Cameron L May 20 '19 at 06:10
  • Is the confidence, the confidence of a previous prediction or what? Do you want to use it as another feature for prediction? If so, just add it to you `X` matrix – JimmyOnThePage May 20 '19 at 06:18

1 Answers1

1

After some further research and guidance from jimmy in the comments I have learned that I was mistaken in thinking the input needed to be in the form

dict({x1, x2, x3,...xn})

where x is the confidence of that prediction. The class_weight needs to be a dictionary in the form

dict({0:y0, 1:z0}, {0:y1, 1:z2}, {0:y1, 1:z1},...)

where y is the confidence/weight of outcome 0 and z the confidence/weight of outcome 1.

This is why ValueError: Class label 2 not present. occurs. Its looking for the next dictionary

Cameron L
  • 86
  • 12