1

I have a sample time-series dataset (23, 14291), which is a pivot table count for 24hrs count for some users; I'm trying to filter some of the columns/features which they don't have a time-series based nature and filter columns to reach meaningful features. I have attempted to already PCA method to keep those with a high amount of data variance or correlation matrix to exclude highly correlated columns/features.

Now I wanted to experiment with feature importance based on this post using some regressors, which was unsuccessful.

I have tried following:

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df3,
                                        #target_attribute, 
                                        test_size=0.2,
                                        random_state=42,
                                        #stratify=y,
                                        shuffle=False)

import xgboost as xgb
from xgboost import XGBRegressor, plot_importance

X_train = trainingSet[:].values
y_train = trainingSet.iloc[:,1].values

X_test = testSet[:].values
y_test = testSet.iloc[:,1].values
y_test_new = y_test.astype('float32')


dtrain = xgb.DMatrix(X_train,y_train)
dtest = xgb.DMatrix(X_test,y_test)

params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}
num_round = 2
model_xgb_1user = xgb.train(params, dtrain, num_round)

pred_test_xgb_1user = model_xgb_1user.predict(dtest)

#from sklearn.multioutput import MultiOutputRegressor
#xgb = MultiOutputRegressor(XGBRegressor(n_estimators=100)).fit(trainingSet, testSet)
#xgb = XGBRegressor(n_estimators=100)
#xgb.fit(trainingSet, testSet)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(df3.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

I'm not sure how I can handle it without label using the regressors. I also read this post Xgboost Feature Importance Computed in 3 Ways with Python I couldn't manage to pass time-series based dataset for feature importance.

Mario
  • 1,631
  • 2
  • 21
  • 51
  • How it was successful? Did you receive a result or did you face an error? If you did not face error, you should ahve obtained a sorted feature importances, which might be more easy to interpret in unsorted format - removing the argsort from the plotting. – paloman Feb 28 '22 at 14:55
  • @paloman I couldn't manage to adapt data since I used [`MultiOutputRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html) to predict all columns\features like `MultiOutputRegressor(XGBRegressor(n_estimators=100)).fit(trainingSet, testSet)` but it turns out despite using this trick *feature importance* can't be achieved easily so I commented those lines. then I tried to see this on single column\user as you see in tried code unsuccessfully. – Mario Feb 28 '22 at 18:00
  • my aim is to take advantage of *feature importance* using Xgboostregressor to filter unnecessary columns that don't have a meaningful time-series friendly nature. Alternatively, I want to compare the results with PCA & correlation matrix results. – Mario Feb 28 '22 at 18:01

0 Answers0