0

How do you apply hypothesis testing to your features in a ML model? Let say for example that I am doing a regression task and I want to cut some features (once I have trained my model) to increase performance. How do I apply hypothesis testing to decide whether that feature is useful or not? I am just a bit confused about what my null hypothesis would be, level of significance and how to run the experimentation to get the p-value of the feature (I have heard that a level of significance of 0.15 is a good threshold, but I am not sure).

For example. I am doing a regression task to predict the cost of my factory, considering the production of three machines (A,B,C). I make a linear regression with the data and I find out that the p-values of machine A is greater than my level of significance, hence, it is not statistically significant and I decide to discard that feature for my model.

I have taken this example from a video on Youtube. I put the link below.

The relevant bit start from min 4:00 to 7:00 https://www.youtube.com/watch?v=HgfHefwK7VQ

I have tried reading about it, but I haven't been able to understand how he decided that level of significance and how he applied hypothesis testing in this case.

The data looks something like this

d = ('Cost': [44439, 43936, 44464, 41533, 46343], 
         'A': [515, 929, 800, 979, 1165], 
         'B': [541, 710, 675, 1147, 939], 
         'C': [928, 711, 824, 758, 635, 901])

    df = pd.DataFrame(data=d)

After the model has been fit, the weights are as follow:

Bias weight: 35102, Machine A: 2.066, Machine B: 4.17, Machine C: 4.79

Now, the issue is that the p-value for Machine A = 0.23, which was considered too high and therefore, this feature was excluded from the predictive model

  • 2
    This sounds like more of a statistics question than a programming question. The good folks over at [Cross Validated](https://stats.stackexchange.com/) should be able to help you out. – A. S. K. Jul 04 '19 at 21:13
  • This is a great question, and well-suited for stats.stackexchange.com. Bear in mind that statistical significance is not practical significance and therefore statistical significance is a misleading indicator for what variables to choose; for large enough sample size, any nonzero association will result in a variable being marked as statistically significant. My advice is to build all 2^n models with subsets of n variables, and rank them by cross validation score. – Robert Dodier Jul 05 '19 at 16:20

0 Answers0