0

I have a multilabel classification problem, I used the following code but the validation accuracy jumps to 99% in the first epochs. my whole input is 1245x1024, so each line means one spectrum (so there are 1245 spectrum examples). So one spectrum is (1x1024). The output are my labels. i have 245 different classes(here elements). One spectrum contains one ore two elements. one output for one prediction is (1x245)

    x = pd.read_csv('spectrum_max2proSpund245.csv') 
   y1 = pd.read_csv('nuclides_max2proSpund245.csv', delimiter=';')

   num_features = 1024
   model = Sequential()
   model.add(Dense(230, kernel_initializer='normal', input_shape=(num_features,), 
        activation='tanh')) 
   model.add(Dropout(0.25))
   model.add(Dense(245, kernel_initializer='normal', activation='sigmoid')) 
   model.summary()

   X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

thats how i complile my model

   model.compile(loss='binary_crossentropy',optimizer='Adam',metrics=['accuracy'])

and thats how i fit and evaluate

   model.fit(X_train, y_train, epochs=20, validation_data=(X_test, y_test), batch_size=60)

   model.evaluate(X_test, y_test, batch_size=60)

model evaluating and model fit have the same results, but when i look manually in my prediction there are maybe only 19% of the predictions correct. what is wrong in my code?

Bienle
  • 1
  • 1
  • If it is a multiclass problem, you have to use `categorical_crossentropy` loss. Also labels need to converted into the categorical format. See `to_categorical()` function in Keras to do this. – Stergios Feb 27 '20 at 12:32
  • 1
    @Stergios . no its multilabel :) . multiclass is when one input gets only one label. but here has one input one or two labels/elements. i have 245 different classes/elements. i applied on the y1 the MultiLabelBinarizer. so that when i get one input with two elements i get maybe something like that [ 0 0 1 0 1 ... 0] which contains two 1 . i was trying to build my neural network like on this page https://www.analyticsvidhya.com/blog/2019/04/build-first-multi-label-image-classification-model-python/ the difference to my neural network is the input, on this page are images as input – Bienle Feb 27 '20 at 12:41
  • The problem is due to the way accuracy is calculated in Keras in multilabel problem. Have a look at this thread (does not contain a solution though): https://stackoverflow.com/questions/50686217/keras-how-is-accuracy-calculated-for-multi-label-classification – Stergios Feb 27 '20 at 18:12
  • thanks @Stergios, maybe thats right, but i don't know how to handle with this information my problem, like u said 'does not contain a solution though' :) – Bienle Feb 28 '20 at 13:54
  • Maybe this is helpful: https://stackoverflow.com/questions/53037451/keras-custom-metrics-for-multi-label-classfication – Stergios Feb 28 '20 at 14:18

1 Answers1

0

Each class needs to have the same number of cases in it. If one class has more cases than the others, it will dominate training. In your training dataset, list all classes and how many records you have for each one. If there are any imbalances you can try to even them out by adding/subtracting data or combining classes.

Tdoggo
  • 411
  • 2
  • 6
  • thanks @Tdoggo but is it normal to have the same number of cases for each class? I know that there is a huge imbalance in my training dataset. I thought it would be enough just to increase the amount of training dataset. So do u think its overfitting because i get an accuracy of 99%? – Bienle Feb 28 '20 at 13:42
  • When you have class imbalance your accuracy becomes worthless. Let's say you have a dataset of 5,000 images of cats and 10 images of dogs. Your model during testing could classify every image as a cat and still be 99% accurate. How many categories do you have? And how many records in each category? One thing you can do, if you have enough data, is remove records from the largest categories until they match the smaller ones. – Tdoggo Feb 29 '20 at 14:56
  • I have 245 categories. I dont know how much records i have in each categorie, because i have a finished code, which gives me 5000 random examples (spectrum with one nuclid(one categorie) and spectrum with 2 nuclides (2 categories), i can only change the number of random examples. But i understand your explanation :) . But its hard to look manually (with 5000 eamples) on the categories which have many examples and then to remove them – Bienle Mar 02 '20 at 16:58
  • Do you have access to the full dataset? – Tdoggo Mar 03 '20 at 06:57
  • yes. i get of the finished code: one csv-data with my inputs (i.e.: 2000x1024, it means 2000 spectra, each spectrum has 1024 channels) and a csv-data with my outputs ( my elements/labels... one line contains one or two elements) – Bienle Mar 03 '20 at 17:25