How to select the best features from all features when data sets don't have target variable where feature importance can change over time?

Question

My data sets have 200 features and 500 rows. from that I must select the best 30 features that can be used in the model instead of all 200 features for the sales prediction model, but the feature importance can change over time. Interesting thing is data sets don't have target variable.

How to select the best features from all the features when data sets don't have target variable to identify where prediction model's feature importance can change over time?

If the feature importance can change over time, how do I select the best features?

Ps: - I tried using Pearson Correlation Matrix, But I want to select K best features for the model training. And I tried to use Chi-2 test to select the best features but ended up with errors since target variable couldn't provide.

score 1 · Answer 1 · answered Oct 19 '22 at 10:53

1

Really Depends. If I understand you correctly you want to train a new model every x time steps where you select new features ?

This wouldn't really make sense in the long run because you will run int the problem that every trained model with different features will result in hugely different results.

Also I don't understand the part of not having a target variable. If you don't have a target to predict what exactly are you trying to achieve with the ML model.

I would advise you to take a step back and really think about what you want to achieve and how you want to achieve it

answered Oct 19 '22 at 10:53

Niko

144
8

this is sales prediction model, but before preparing & selecting an algo for the model I need to do feature engineering task, but there were no identifiable target data column. but I need to predict sales. What if the features importance is seasonal? – sunone5 Oct 19 '22 at 11:10
1

Again what do you mean with there was no identifiable target column. The target variable is what you want the ML Model to predict. If you need to predict the sales sales would be your target column. Again for training you need a target volume to calculate the loss against. If your data is f.e. price margin sales you need to input all three columns into the model if you just have price and margin you can't predict the sales because you don't have that data. – Niko Oct 19 '22 at 11:36
Hi this repo has "demo_dataset.csv" if you can investigate datasets, it's highly appreciated. https://github.com/sunone5/feature_selection – sunone5 Oct 20 '22 at 05:10
@https://stackoverflow.com/users/17217956/niko - my ultimate target is to do feature engineering just before the sale prediction, but this dataset is weird for the prediction. – sunone5 Oct 20 '22 at 05:12
I think you have a severe misunderstanding of how a ml model works and how it uses data to make prediction. If you train a ml model on data with feature x y z and target of a you can't switch out the features used for training to make a prediction. Feature engineering can only be done on the data before model training. After model training is done a ml model takes the same structure of inputs to make a prediction. What you want to achieve is not possible. I would advice you to find out which features are most important for sales prediction and use these to train a model. – Niko Oct 20 '22 at 07:01
I can't agree, I may have severe misunderstandings about datasets but not about how ML model works. I'm trying to understand these datasets. I think you didn't read my previous comments properly, I guess. "Feature engineering can only be done on the data before model training. " - that's what I am trying to do if you could be able to read my previous comments and original questions, you'll get it. Were you able to investigate datasets? But anyway, thanks for your valuable time and comments. – sunone5 Oct 20 '22 at 07:37
Well if you are gonna do feature engineering before training and feature importance changes over time and you train a new model on new features on a new timeframe you will end up with completely different models whose result you can't compare accurately. Also you still don't know what you want to predict. You simply can't train a ml model without a target to predict. Also: "my ultimate target is to do feature engineering just before the sale prediction" really sounds like you want to train model -> feature engineer -> predict. – Niko Oct 20 '22 at 07:46
yes, completely different models whose results may not be comparable accurately - agree about this point, but regardless of the comparability of accuracy results I'm trying & focusing on to get understanding about this dataset even if it doesn't have any target variable defined in data itself, if you can look at the column headers. # Feature 1 date_time object # Feature 22 to 40 date_time object # Feature 21 is Country/Region – sunone5 Oct 20 '22 at 08:06
How do you want to understand the data if you don't know what the features are ? I looked at your data but its not something you can really understand if you don't know what the data represents. FE: The first Features are easy to figure out: D,Company_ID,Company_Name,Firstname,Surname,Address,Postcode,Phone After that you have a block of data, a block of timestamped data and another block of data. The problem is that with the exception of the named features and the timestamped block the data has already been scaled/normalized. – Niko Oct 20 '22 at 09:10
So you basically have around 160 normalized/scaled features where you don't know what they mean or represent. You have no way of figuring out what these data point are or represent. All you can do is look at the relationship between the features. This however will be severely skewed because how will you ever know what exactly you are comparing. You could calculate the correlation over the dataset and the change of correlation between the features over time. But what will that result achieve for you ? – Niko Oct 20 '22 at 09:15
Also another problem is since the data is already normalized/scaled and you don't know how they were scaled/normalized you will never figure out the real values behind the normalized/scaled data. So the only basis you again have is looking only at the relationship between the features. But again If you don't know the target you also won't know which feature affects the target the most because you are missing that data. If f.e. you find out that feature 145 is hugely correlated to feature 87 what does that result yield for you ? – Niko Oct 20 '22 at 09:18
deliberately dropped the following features - Company_ID, Company_Name, Firstname, Surname, Address, Postcode, Phone, feature21, since I assumed these features are not relevant to sales predictions. feature1 is datetime and feature2 to feature20 are normalized/scaled features and again feature22 to feature40 datetime and feature41 to feature200 are normalized/scaled features. Yes, I did calculate the correlation over the dataset you can see it with source code in repos. Pearson correlation matrix showed me there are no highly correlated features to removed form the model to be trained – sunone5 Oct 20 '22 at 09:45

How to select the best features from all features when data sets don't have target variable where feature importance can change over time?

1 Answers1