I am using Sklearn.train_test_split and sklearn.MLPClassifier for human activity recognition. Below is my dataset in a pandas df:
a_x a_y a_z g_x g_y g_z activity
0 3.058150 5.524902 -7.415221 0.001280 -0.022299 -0.009420 sit
1 3.065333 5.524902 -7.422403 -0.003514 -0.023764 -0.007289 sit
2 3.065333 5.524902 -7.422403 -0.003514 -0.023764 -0.007289 sit
3 3.064734 5.534479 -7.406840 -0.016830 -0.025628 -0.003294 sit
4 3.074910 5.548246 -7.408038 -0.023488 -0.025495 -0.001963 sit
... ... ... ... ... ... ... ...
246886 8.102990 -1.226492 -4.559391 -0.511287 0.081455 0.109515 run
246887 8.120349 -1.218711 -4.595306 -0.516480 0.089179 0.110047 run
246888 8.126933 -1.209732 -4.619848 -0.521940 0.096636 0.109382 run
246889 8.140102 -1.199556 -4.622840 -0.526467 0.102761 0.108183 run
246890 8.142496 -1.199556 -4.648580 -0.530728 0.109818 0.108050 run
1469469 rows × 7 columns
I am using the 6 numerical columns (x,y,z from accelerometer and gyrosphere) to predict activity (run, sit, walk). My code looks like
mlp=MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam', learning_rate='adaptive',
early_stopping=True, learning_rate_init=.001)
X=HAR.drop(columns='activity').to_numpy()
y=HAR['activity'].to_numpy()
X_train, X_test, y_train, y_test=train_test_split(X,y, train_size=0.10)
mlp.fit(X_train, y_train)
predictions_train=mlp.predict(X_train)
predictions_test=mlp.predict(X_test)
print("Fitting of train data for size (10,): \n",classification_report(y_train,predictions_train))
print("Fitting of test data for size (10,): \n",classification_report(y_test,predictions_test))
Output is:
Fitting of train data for size (10,):
precision recall f1-score support
run 1.00 1.00 1.00 49265
sit 1.00 1.00 1.00 49120
walk 1.00 1.00 1.00 48561
accuracy 1.00 146946
macro avg 1.00 1.00 1.00 146946
weighted avg 1.00 1.00 1.00 146946
Fitting of test data for size (10,):
precision recall f1-score support
run 1.00 1.00 1.00 441437
sit 1.00 1.00 1.00 442540
walk 1.00 1.00 1.00 438546
accuracy 1.00 1322523
macro avg 1.00 1.00 1.00 1322523
weighted avg 1.00 1.00 1.00 1322523
I am relatively new to ML but I think I understand the concept of overfitting, so I imagine that is what is happening here, but I don't understand how it is being overfit when it is only being trained on 10% of the dataset? Also, presumably the classification report should always be perfect for the X_train data since that is what the model is being trained on, correct?
No matter what I do, it always produces a perfect classification_report for the X_test data no matter how little data I train it on (in this case .10 but i've done .25, .5, .33 etc.). I even removed the gyrosphere data and only trained it on the accelerometer data and it still gave a perfect 1 for each precision, recall, and F1.
When I arbitrarily slice the original dataset in half and use the resulting arrays as train and test data then the predictions for X_test are not perfect but every time I use the sklearn.train_test_split it returns a perfect classification report....So i assume I am doing something wrong with how I am using train_test_split?