While evaluating xgboost model performance, I find that transaction_id column which is just a column of numbers from 1 to length of dataframe has a higher importance than the rest of the columns. I also have random values column which has a zero feature importance. Does splitting the dataframe without removing this column result in data leakage while random train-test splitting? There are multiple transaction_ids for a single person in the dataframe.
Asked
Active
Viewed 121 times
1
-
What do you think could be the answer? – AlexK Jun 13 '22 at 04:31
-
I'm not sure, I also think this could also mean as the number of transaction increase(with the index-like variable) the chances of fraud increases. – Divyanshu Chauhan Jun 15 '22 at 15:37