I have a machine learning model that i feel is still getting false positives. It can largely detect attacks that i produce separately from the training / test set, maybe at a 80% rate? But for me that is not enough. I also tried to drop columns with high correlation. My biggest problem is my understanding of whether to use one-hot-encoding or not. I can switch between both one hot and sparse and i don't notice a difference at all in my dataset.
The dataset is like this:
column 1 - column 2 - column 3 - etc, all containing stuff like packet properties, and then at the end, the class. So, class 1, class 2 or class 3. Any one row can only belong to one class, it can't be two attack types, it has to distinguish between all attack types and then assign this particular row the best attack type match! This is different from one-hot where it's meant that, if i understand right, a row can belong to multiple attack types. I however notice that nobody ever uses sparse_categorical_crossentropy when it comes to even the iris dataset which is very similar to mine, as it has more than 3 classes.
I can paste my code here and if somebody knows where i am going wrong! :Z
label_encoder = preprocessing.LabelEncoder()
y = ConcatenateAttackList['Label']
encoded_y = label_encoder.fit_transform(y)
y = np_utils.to_categorical(encoded_y)
x = ConcatenateAttackList.drop(['Label', ], axis = 1).astype(float)
sc = MinMaxScaler()
print('x_train, y_train, fitting and transforming.')
x = sc.fit_transform(x)
x,y = oversample.fit_resample(x,y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42,stratify=y,
shuffle=True)
len(x_train)
len(y_train)
X = pd.DataFrame(x_train)
print('x_train, y_train, fitted and transformed.')
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(4*256).batch(256)
validate = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(256)
model = Sequential()
print('Model initialized.')
model.add(Dense(64,input_dim=len(X.columns),activation='relu')) # input layer
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(6, activation='softmax'))
print('Nodes added to layers.')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics='categorical_accuracy')
print('Compiled.')
callback=tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='auto',
patience=50, min_delta=0, restore_best_weights=True, verbose=2)
print('EarlyStopping CallBack executed.')
print('Beginning fitting...')
model_hist = model.fit(x_train, y_train, epochs=231, batch_size=256, verbose=1,
callbacks=[callback],validation_data=validate)
print('Fitting completed.')
model.save("sets/mymodel5.h5")
dump(sc, 'sets/scaler_transformTCPDCV5.joblib')
print('Model saved.')
# loss history
plt.plot(model_hist.history['loss'], label="Training Loss")
plt.plot(model_hist.history['val_loss'], label="Validation Loss")
plt.legend()
#------------PREDICTION
tester = pd.read_csv('AttackTestFile.csv', sep=r'\s*,\s*', engine='python')
ColumnsForWindowsCIC = pd.read_csv('ColumnsForWindowsCIC.csv')
tester.columns = ColumnsForWindowsCIC.columns
tester = deleteRedudancy(tester)
x = tester.drop(['Label', ], axis = 1)
fit_new_input = sc.transform(x)
predict_y=model.predict(fit_new_input)
predict_y
classes_y=np.argmax(predict_y,axis=1)
classes_y
predict = label_encoder.inverse_transform(classes_y)
predict