0

I am creating a lightGBM Model for prediction using Python. Initially, i did the data split using sklearn.model_selection.train_test_split which resulted into lower Mean absolute error(MAE). Later, i did the split in some other way by splitting the dataframe into two different data frames, df_train and df_test. With this approach, MAE is significantly higher than earlier approach. Is the use of sklearn.model_selection.train_test_split mandatory in LightGBM or data could be splitted in any way? If it is not mandatory, the results should be somewhat similar. In my case, its very different. Looking for suggestions/help.

CSK
  • 67
  • 3
  • 11

1 Answers1

1

To always keep the same outcome with sklearn.model_selection.train_test_split you have to keep the random_state:

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

based on the documentation:

random_state : int, RandomState instance or None, optional (default=None)

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

otherwise you cannot produce the same result.

If you have the feeling the split does not fit to your dataframe, you should use cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html , there you are avoiding over- and underfitting for a specific train/test split

PV8
  • 5,799
  • 7
  • 43
  • 87
  • Thanks for the information. However, my question is not about splitting the data in same way. My question is if it is possible to use some other methods to do the split in lightGBM other than train_test_split? – CSK Aug 20 '19 at 12:49
  • 1
    I have literaly no idea of what are lightGBM but like all machine learning algortihm, they are sensitive of training and validation data distribution. You can use whatever method you want, the only difference you have seen in you'r result are caused by the different training set! – akhetos Aug 20 '19 at 12:54
  • 2
    btw, keep using traint_test_split and use the argument **shuffle=True** which will shuffle you'r data and lower the chance to have data from only one class in ur training set (which probably happened when you made you'r split manually) – akhetos Aug 20 '19 at 12:55
  • 1
    If you have an issue with one specific outcome, use cross-validation and run several train/test splits to avoid a specific outcome because of the split – PV8 Aug 20 '19 at 12:55
  • Thanks for the suggestions. – CSK Aug 20 '19 at 15:06