1

I am using Random Forest classifier for the classification and in each iteration I get different results. My code is as follows.

input_file = 'sample.csv'

df1 = pd.read_csv(input_file)
df2 = pd.read_csv(input_file)
X=df1.drop(['lable'], axis=1)  # Features
y=df2['lable']  # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

As suggested by other answers I added the parameter n_estimators and random_state. However, it did not work for me.

I have attached the csv file here:

I am happy to provide more details if needed.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
EmJ
  • 4,398
  • 9
  • 44
  • 105

1 Answers1

1

You need to set the random state for the train-test splitting as well.

The following code would give you a reproducible results. The recommended approach is not to change the random_state value for improving performance.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import  RandomForestClassifier
from sklearn import metrics

df1=pd.read_csv('sample.csv')

X=df1.drop(['lable'], axis=1)  # Features
y=df1['lable']  # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=5)

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Output:

Accuracy: 0.6777777777777778

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • Thanks a lot for the great answer. Could you please tell me what do you mean by `The recommended approach is not to change the random_state value for improving performance.`? I did not get it. Looking forward to hearing from you. Thank you :) – EmJ Mar 28 '19 at 04:26
  • 1
    Once your fix some value for the random_state, don't change it during your modelling process (When you tune other parameter like `n_estimators`,`max_depth`,etc.) – Venkatachalam Mar 28 '19 at 04:31
  • 1
    Also kindly take some time to review https://stackoverflow.com/help/someone-answers – Venkatachalam Mar 28 '19 at 04:36
  • Thanks a lot. Could you please tell me why we have two `random_state` values; 5 and 42. Don't they need to be same? Moreover, is there a way to identify optimal `random state` values or can we assign them randomly? Thank you once again :) – EmJ Mar 28 '19 at 04:40
  • 1
    Glad that I can help! you can have different values for these two random_states – Venkatachalam Mar 28 '19 at 04:59
  • 1
    You should not try to find the optimal random_state. Read [here](https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune). You have to assign some random value to it!!! – Venkatachalam Mar 28 '19 at 05:00
  • 1
    Please let me know if you know an answer for this: https://stackoverflow.com/questions/55466081/how-to-calculate-feature-importance-in-cross-validatoin-in-sklearn Thank you :) – EmJ Apr 02 '19 at 05:10