3

I'm trying to optimize a column of data in a Pandas dataframe. I've looked through past posts but couldn't find one that addressed the issue of optimizing values in a column in a dataframe. This is my first post and relatively new to coding so apologizes upfront. Below is the code I'm using

from pandas import DataFrame
import numpy as np
from pulp import *

heading = [184, 153, 140, 122, 119]
df = DataFrame (heading, columns=['heading'])
df['speed'] = 50 
df['ratio'] = df.speed/df.heading
conditions = [
    (df['ratio'] < 0.1),
    (df['ratio'] >= 0.1 ) & (df['ratio'] < 0.2),
    (df['ratio'] >= 0.2 ) & (df['ratio'] < 0.3),
    (df['ratio'] >= 0.3 ) & (df['ratio'] < 0.4),
    (df['ratio'] > 0.4 )]
choices = [3, 1, 8, 5, 2]
df['choice'] = np.select(conditions, choices)
df['final_column'] = df.choice * df.heading

print(np.sum(df.final_column))

I use np.select to search through 'conditions' and return the appropriate 'choices'. This is functioning like a vlookup I use in excel.

I'm trying to get PuLP or any other appropriate optimization tool or maybe even just a loop to find the optimal values for df.speed (which I start with temporary value of 50) to maximize the sum of values in the 'final_column.' Below is the code I've tried but its not working.

prob = LpProblem("Optimal Values",LpMaximize)
speed_vars = LpVariable("Variable",df.speed,lowBound=0,cat='Integer')
prob += lpSum(df.new_column_final)
prob.solve()

Below is the error I'm getting:

speed_vars = LpVariable("Variable",df.speed,lowBound=0,cat='Integer') TypeError: init() got multiple values for argument 'lowBound'

Thanks so much for your help. Any help would be appreciated!

FlettnerRocks
  • 49
  • 1
  • 4
  • what do you mean by 'optimize a column of data'? – gtomer Sep 15 '20 at 10:42
  • sorry - i mean i'd like the data/value in each row of the the column to be the best/optimal choice to make the row the highest value and the sum of all of the rows to be the highest possible value. – FlettnerRocks Sep 16 '20 at 09:26
  • Can you provide a [mcve] - basically a toy problem with all of the data required in the code in your question where the answer is obvious. This makes it easier for people to answer and you can expand/apply the method to your larger/more complex problem. – kabdulla Sep 18 '20 at 06:18
  • To add, please show us a sample of current data and desired results which can better illustrate your issue. Since human language is not always precise, numbers tend to be more helpful than words in text. Finally, do not assume we now your domain like PuLP. – Parfait Sep 19 '20 at 10:52
  • thanks kabdulla and Parfait - simplified and edited accordingly – FlettnerRocks Sep 19 '20 at 15:31

1 Answers1

3

First of all the specific error message you are getting: TypeError: __init__() got multiple values for argument 'lowBound'

In python when calling a function you can pass arguments either by 'position' - which means the order in which you pass the arguments tells the function what each of them is - or by naming them. If you look up the documentation for the pulp.LpVariable method you'll see the second position argument is 'lowbound' which you then also pass as a named argument - hence the error message.

I think you might also be slighly misunderstanding how a dataframe works. It is not like excel where you set a 'formula' in a column and it stays updated to that formula as other elements on that row change. You can assign values to columns but if the input data change - the cell would only be updated if that bit of code was run again.

In terms of solving your problem - I'm not convinced I've understood what you're trying to do but I've understood the following.

  • We want to select values of df['speed'] to maximise the sum-product of heading and choices columns
  • The value of the choices column depends on the ratio of speed to heading (as per the given 5 ranges)
  • Heading column is fixed

By inspection the optimum will be achieved by setting all of the speeds so that the ratios are in the [0.2 - 0.3] range, and where they fall in that range doesn't matter. Code to do this in PuLP within pandas dataframes below. It relised on using binary variables to keep track of which range the ratios fall in.

The syntax is a little awkward though - I'd recommend doing the optimisation completely outside of dataframes and just loading results in at the end - using the LpVariable.dicts method to create arrays of variables instead.

from pandas import DataFrame
import numpy as np
from pulp import *

headings = [184.0, 153.0, 140.0, 122.0, 119.0]
df = DataFrame (headings, columns=['heading'])
df['speed'] = 50
max_speed = 500.0
max_ratio = max_speed / np.min(headings)
df['ratio'] = df.speed/df.heading
conditions_lb = [0,   0.1, 0.2, 0.3, 0.4]
conditions_ub = [0.1, 0.2, 0.3, 0.4, max_speed / np.min(headings)]
choices = [3, 1, 8, 5, 2]
n_range = len(choices)
n_rows = len(df)

# Create primary ratio variables - one for each variable:
df['speed_vars'] = [LpVariable("speed_"+str(j)) for j in range(n_rows)]

# Create auxilary variables - binaries to control
# which bit of range each speed is in
df['aux_vars'] = [[LpVariable("aux_"+str(i)+"_"+str(j), cat='Binary')
                   for i in range(n_range)]
                   for j in range(n_rows)]

# Declare problem
prob = LpProblem("max_pd_column",LpMaximize)

# Define objective function
prob += lpSum([df['aux_vars'][j][i]*choices[i]*headings[j] for i in range(n_range)
               for j in range(n_rows)])

# Constrain only one range to be selected for each row
for j in range(n_rows):
    prob += lpSum([df['aux_vars'][j][i] for i in range(n_range)]) == 1

# Constrain the value of the speed by the ratio range selected
for j in range(n_rows):
    for i in range(n_range):
        prob += df['speed_vars'][j]*(1.0/df['heading'][j]) <= \
                        conditions_ub[i] + (1-df['aux_vars'][j][i])*max_ratio
        prob += df['speed_vars'][j]*(1.0/df['heading'][j]) >= \
                        conditions_lb[i]*df['aux_vars'][j][i]

# Solve problem and print results
prob.solve()

# Dislay the optimums of each var in problem
for v in prob.variables ():
    print (v.name, "=", v.varValue)

# Set values in dataframe and print:
df['speed_opt'] = [df['speed_vars'][j].varValue for j in range(n_rows)]
df['ratio_opt'] = df.speed_opt/df.heading
print(df)

The last bit of which prints out:

   heading speed_vars                                    b  spd_opt  rat_opt
0    184.0    speed_0  [b_0_0, b_1_0, b_2_0, b_3_0, b_4_0]     36.8      0.2
1    153.0    speed_1  [b_0_1, b_1_1, b_2_1, b_3_1, b_4_1]     30.6      0.2
2    140.0    speed_2  [b_0_2, b_1_2, b_2_2, b_3_2, b_4_2]     28.0      0.2
3    122.0    speed_3  [b_0_3, b_1_3, b_2_3, b_3_3, b_4_3]     24.4      0.2
4    119.0    speed_4  [b_0_4, b_1_4, b_2_4, b_3_4, b_4_4]     23.8      0.2
halfer
  • 19,824
  • 17
  • 99
  • 186
kabdulla
  • 5,199
  • 3
  • 17
  • 30