Deep Learning: Multiclass Classification with same amount of labels between the training dataset and test dataset

Question

I'm writing a code for doing a multiclass classification. I have custom datasets with 7 columns (6 features and 1 label), the training dataset has 2 types of label (1 and 2), and the testing dataset has 3 types of labels (1, 2, and 3). The aim of the model is to see how well the model predicting the label '3'. As of now, I'm trying the MLP algorithm, the code is as follows:

import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras import optimizers
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
from keras.models import load_model
from sklearn.externals import joblib
from joblib import dump, load
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#from keras.layers import Dense, Embedding, LSTM, GRU
#from keras.layers.embeddings import Embedding


#Load the test dataset
df1 = pd.read_csv("/home/user/Desktop/FinalTestSet.csv")
test = df1

le = LabelEncoder()

test['Average_packets_per_flow'] = le.fit_transform(test['Average_packets_per_flow'])
test['Average_PktSize_per_flow'] = le.fit_transform(test['Average_PktSize_per_flow'])
test['Avg_pkts_per_sec'] = le.fit_transform(test['Avg_pkts_per_sec'])
test['Avg_bytes_per_sec'] = le.fit_transform(test['Avg_bytes_per_sec'])
test['N_pkts_per_flow'] = le.fit_transform(test['N_pkts_per_flow'])
test['N_pkts_size_per_flow'] = le.fit_transform(test['N_pkts_size_per_flow'])

#Select the x and y columns from dataset
xtest_Val = test.iloc[:,0:6].values
Ytest = test.iloc[:,6].values
#print Ytest

#MinMax Scaler
scaler = MinMaxScaler(feature_range=(-1, 1))
Xtest = scaler.fit_transform(xtest_Val)

#print Xtest

#Load the train dataset
df2 = pd.read_csv("/home/user/Desktop/FinalTrainingSet.csv")
train = df2

le = LabelEncoder()

test['Average_packets_per_flow'] = le.fit_transform(test['Average_packets_per_flow'])
test['Average_PktSize_per_flow'] = le.fit_transform(test['Average_PktSize_per_flow'])
test['Avg_pkts_per_sec'] = le.fit_transform(test['Avg_pkts_per_sec'])
test['Avg_bytes_per_sec'] = le.fit_transform(test['Avg_bytes_per_sec'])
test['N_pkts_per_flow'] = le.fit_transform(test['N_pkts_per_flow'])
test['N_pkts_size_per_flow'] = le.fit_transform(test['N_pkts_size_per_flow'])

#Select the x and y columns from dataset
xtrain_Val = train.iloc[:,0:6].values
Ytrain = train.iloc[:,6].values
#print Ytrain

#MinMax Scaler
scaler = MinMaxScaler(feature_range=(-1, 1))

# Fit the model
Xtrain = scaler.fit_transform(xtrain_Val)


#Reshape data for CNN
Xtrain = Xtrain.reshape((Xtrain.shape[0], 1, 6, 1))
print(Xtrain)
#Xtest = Xtest.reshape((Xtest.shape[0], 1, 6, 1))
#print Xtrain.shape

max_length=70
EMBEDDING_DIM=100
vocab_size=100
num_labels=2

#Define model
def init_model():
    model = Sequential()
    model.add(Dense(64, activation='relu', input_dim=Xtrain.shape[0]))
    model.add(Flatten())
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))  
    model.add(Flatten())
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='softmax'))
    model.add(Flatten())

#adam optimizer
    adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

    model.compile(optimizer = adam, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

print('Train...')
model = init_model()

#To avoid overfitting
callbacks = [EarlyStopping('val_loss', patience=3)]
hist = model.fit(Xtrain, Ytrain, epochs=50, batch_size=50, validation_split=0.20, callbacks=callbacks, verbose=1)

#Evaluate model and print results
score, acc = model.evaluate(Xtest, Ytest, batch_size=50)
print('Test score:', score)
print('Test accuracy:', acc)

However, I'm getting the following error:

ValueError: Input 0 is incompatible with layer flatten_1: expected min_ndim=3, found ndim=2

I tried to remove the flatten layers, but getting different error:

ValueError: Error when checking input: expected dense_1_input to have shape (424686,) but got array with shape (6,)

424686 is the number of rows in dataset and 6 is the number of features.

I appreciate any suggestion. Thank you.

Based on Omarfoq suggestion, now I used three labels for both the training and testing datasets. The code and error remains unchanged.

Can anyone please suggest me the solution? Thank you.

score 1 · Answer 1 · answered Jan 13 '20 at 15:57

1

I would say that what you are trying is not logical, your model will never predict class "3" if it doesn't exist in the training set. What you are trying have no sense. Try to reformulate your problem.

answered Jan 13 '20 at 15:57

hola

592
5
19

Can you please elaborate why it isn't possible? Let's say in my training dataset there are two type of pictures, which are 'cat' and 'bird', and I want to distinguish the 'dog' pictures that are only available in the test dataset. Shouldn't the model be able to distinguish that the 'dog' pictures are neither the 'cat' and 'bird' pictures? – Naina Kulkarni Jan 13 '20 at 16:18
1

Well in reality you have a possbility to do this, the solution consists in adding some random images to your dataset, and ask your model to classify images to "cat", "bird", "out_of_class". But if you create a model that only classifies images to "cat", "bird" this won't work proprely. You can use in a really dirty way, by giving the label "dog" to images with equal probability between the two classes, but don't expect that this will give you good results – hola Jan 13 '20 at 16:27
Thank you for the explanation, is it correct to say that the error that I got is because I use different number of labels for training and testing dataset? – Naina Kulkarni Jan 14 '20 at 14:32
Definetly yes, the problem you are having is because you are using an output of size 2 , so your model outputs a vector of dimension 2, and in the same time in the second step you are waiting that it will give you a vector of size 3. And this is what causes the error you have. – hola Jan 14 '20 at 14:37
I tried with same number of labels in both datasets, and yet same error is reproduced. Can you please suggest some alternative solution. – Naina Kulkarni Jan 15 '20 at 09:39
Can you update the question by adding the new code and the new error you have please – hola Jan 15 '20 at 09:40
I have added the new description, but basically the code and error remain the same. Kindly look at the issue. – Naina Kulkarni Jan 15 '20 at 14:11
`model.add(Dense(64, activation='relu', input_dim=Xtrain.shape[0]))` change this to `model.add(Dense(64, activation='relu', input_dim=(1, 6, 1))` – hola Jan 15 '20 at 15:23

Deep Learning: Multiclass Classification with same amount of labels between the training dataset and test dataset

1 Answers1