0

Does anyone used any optimization models on fitted sklearn models?

What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.

Some example, simplified code:

import pandas as pd

df = pd.DataFrame({
    'temperature': [10, 15, 30, 20, 25, 30],
    'working_hours': [10, 12, 12, 10, 30, 15],
    'sales': [4, 7, 6, 7.3, 10, 8]
})

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);

Our baseline is a simple loop and predict all combination of variables:

results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
    for work_hours in np.arange(1,60.01,1):
        results = pd.concat([
            results, 
            pd.DataFrame({
                'temperature': temp, 
                'working_hours': work_hours, 
                'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
            }
            )
        ]
        )

print(results.sort_values(by='sales_predicted', ascending=False))

Using that way it's difficult or impossible to: * do it fast (brute method) * implement constraint concerning two or more variables dependency

We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:

TypeError: float() argument must be a string or a number, not 'LpVariable'

Do anyone have any idea how we can get rid off loop and use some other stuff?

michlimes
  • 11
  • 1
  • 3
  • The keywords are black-box optimization or gradient-free optimization. There is too much to say and it's not really something for stackoverflow. All your candidates are not suited for this as all their assumptions are wrong (random examples; not mapped to those candidates: differentiable, continuous, convex). In black-box opt there are lots of approaches, bayesian, surrogate losses and all that stuff... but to be honest: in you use-case grid-search, random-search or some bandit-based random-search is very competitive. Hyperparameter-tuning would be one more keyword to google. – sascha Nov 22 '19 at 21:59
  • Above focuses on the "vs. brute-force" question. If all you want is constraint-based filtering of your grid-search (loops), this is either easy to filter out (for simple things) in the loop or you might go for sat-solving / constraint-programming techniques, where at least the former has lots of theory in terms of uniform solution-sampling. But this rapidly goes towards research stuff. – sascha Nov 22 '19 at 22:02
  • Did you find any solutions? I have the same problem – Saeed Nov 02 '21 at 19:41

1 Answers1

0

When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.

You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:

max_sales_predicted = 0

for temp in np.arange(1, 100.01, 1):
    for work_hours in np.arange(1, 60.01, 1):
        sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
        if sales_predicted > max_sales_predicted:
            max_sales_predicted = sales_predicted
            desired_temp = temp
            desired_work_hours = work_hours

So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.

The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

FatihAkici
  • 4,679
  • 2
  • 31
  • 48