-1

im try to do training and testing for my decision tree classifier. im still new in decision tree. i have 150 data with two columns in my csv file and im tried to split it into 100 training and 50 for testing. i've tried using scikit but i still don't understand.

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=17)
classifier.fit(train_x, train_Y)
pred_y = classifier.predict(test_x)
print(classification_report(test_Y,pred_y))
accuracy_score(test_Y,pred_y)

can anyone help me how to do it ? i appreciate every help

fera fani
  • 11
  • 4

1 Answers1

0

You need to perform a train-test-split.

As you got 150 samples in total and 50 should be part of your test set, you can set the test size as an integer equal to 50.

You might want to set the random_state for reproducability. Generally, it's also good advice to leave shuffle=True activated. If your data is time-correlated, deactivate it to prevent data leakage. You can find detailled examples in this book.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=50, random_state=42)
KarelZe
  • 1,466
  • 1
  • 11
  • 21
  • so to do training and testing datasets you need to split it ? – fera fani Jan 04 '23 at 07:18
  • Yes, split your master dataframe into two subsets. Use the training set to fit your model and the test set to evaluate your model on unseen samples. You could have an additional validation set, if you perform hyperparameter tuning, but the number of samples will get really small. – KarelZe Jan 04 '23 at 08:24
  • 1
    ok, thanks for the explenation – fera fani Jan 04 '23 at 09:06