Random Forest Train Test Split Accuracy

Question

I am working through a random forest model for the first time and have come across an issue with my accuracy quantification.

Currently, I split the dataset (30% as test size), fit the model, then predict y values based on my model, and score the model based on the testing values predicted. But I am currently getting a 100% accuracy issue, which I am wondering if it is because of the parameters set by my model, or due to me making a syntax error along the way.

Split the training set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=1)

Create and fit the model

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000,
                           random_state = 42,
                           min_samples_split = 10,
                           max_features = "sqrt",
                           bootstrap = True)

# Train the model on training data
rf.fit(X_train, y_train)

Predict on test set and calculate accuracy

y_pred = rf.predict(X_test)

print("Accuracy:", round((rf.score(X_test, y_pred)*100),2), "%")

>> 100.0%

I am definitely learning as I go, but have had some formal trainings. Really just thrilled about the aspect of modeling, but want to figure out what mistakes I am making as I continue learning this process.

If you are just looking for accuracy, you can directly got with `accuracy_score()` function from scikit: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html — Ashwin Geet D'Sa, Apr 15 '21 at 22:14

score 0 · Answer 1 · edited May 11 '21 at 21:44

You are almost there! The score() method accept X_test and y_test, the logic behind the score():

# simplified logic behind score()

def score(X, y):
  y_predicted = model.predict(X)
  value = compute_metric(y, y_predicted)
  return value

The above logic is just to show how the score works.

To get the score in your code:

rf.score(X_test, y_test)

You will get the R^2 score. docs Do you know now, why you get 100%?

If you want to get other metrics then you need to compute predictions and use regression metrics -> https://scikit-learn.org/stable/modules/classes.html#regression-metrics

You can also use AutoML for learning (yourself not a model). You can run AutoML to create the baseline models. AutoML will compute many metrics for you. Then you can write your own script and compare results.

Random Forest Train Test Split Accuracy

1 Answers1