What does OR-Tools' CP-SAT favor the first variable?

Question

I have a function that solves the Stigler Diet problem using the CP-SAT solver (instead of the linear solver) and minimizes error (instead of minimizing the quantity of food).

Initially, I was trying to use AddLinearConstraint to specify the minimum and maximum allowable amount of each nutrient, but the solver returned that this problem was "Infeasible". So, I just used the minimum bound and printed which nutrient was out of bounds.

For some reason, the solver chooses a solution whose first variable is far larger than the other variables, no matter what variable (nutrient) is first. For example, the allowable amount of Calcium is between 1,300 mg and 3,000 mg, but it choses a solution of 365,136 mg. If I change the first variable to Carbohydrates, then the new solution's Carbohydrate value is similarly out of proportion, while the other variables (including Calcium) are within the allowable bounds.

Why does the solver favor the first variable? If I can understand this, then I think I should be able to figure out how to get all variables within the bounds.

Below is the essential part of my program. Full working code is here: https://github.com/TravisDart/nutritionally-complete-foods

# "nutritional_requirements" and "foods" are passed in from CSV files after some preprocessing.
def solve_it(nutritional_requirements, foods):
    model = cp_model.CpModel()

    quantity_of_food = [
        model.NewIntVar(0, MAX_NUMBER * NUMBER_SCALE, food[2]) for food in foods
    ]
    error_for_quantity = [
        model.NewIntVar(0, MAX_NUMBER * NUMBER_SCALE, f"Abs {food[2]}") for food in foods
    ]

    for i, nutrient in enumerate(nutritional_requirements):
        model.Add(
            sum([food[i + FOOD_OFFSET][0] * quantity_of_food[j] for j, food in enumerate(foods)]) > nutrient[1][0]
        )
        model.AddAbsEquality(
            target=error_for_quantity[i],
            expr=sum([food[i + FOOD_OFFSET][0] * quantity_of_food[j] for j, food in enumerate(foods)]) - nutrient[1][0],
        )

    model.Minimize(sum(error_for_quantity))

    solver = cp_model.CpSolver()
    # The solution printer displays the nutrient that is out of bounds.
    solution_printer = VarArraySolutionPrinter(quantity_of_food, nutritional_requirements, foods)

    status = solver.Solve(model, solution_printer)
    outcomes = [
        "UNKNOWN",
        "MODEL_INVALID",
        "FEASIBLE",
        "INFEASIBLE",
        "OPTIMAL",
    ]
    print(outcomes[status])

Please print a working example. – Laurent Perron Aug 24 '23 at 07:40 — Laurent Perron, Aug 24 '23 at 07:40
I have added a link to the working example. – Travis Aug 25 '23 at 05:15 — Travis, Aug 25 '23 at 05:15

Laurent Perron · Answer 1 · 2023-08-27T20:00:54.323

there are 2 issues:

Data is inconsistent

for Copper:

daily recommended value is :

['Copper', (890000, 'mcg'), (8000000, 'mcg')]

Food nutrient values are:

name = Acorn stew (Apache) intake=(0, 'mg')
name = Agave, cooked (Southwest) intake=(1, 'mg')
name = Agave, dried (Southwest) intake=(2, 'mg')
name = Agave, raw (Southwest) intake=(1, 'mg')
name = Blackberries, wild, raw (Alaska Native) intake=(2, 'mg')
name = Blueberries, wild, frozen (Alaska Native) intake=(0, None)
name = Blueberries, wild, raw (Alaska Native) intake=(0, 'mg')
name = Bread, blue corn, somiviki (Hopi) intake=(1, 'mg')
name = Bread, kneel down (Navajo) intake=(1, 'mg')
...

so a factor of 1000 in units.

There are other uses of mcg:

Copper,890 mcg,8000 mcg,mcg
Fluoride,3000 mcg,10000 mcg,mcg
Folate,400 mcg,800 mcg,mcg 
Manganese,2200 mcg,9000 mcg,mcg
Selenium,55 mcg,400 mcg,mcg
Vitamin A,900 mcg,2800 mcg,mcg 
Vitamin B12,2.4 mcg,,mcg
Vitamin D,15 mcg,100 mcg,mcg
Vitamin K,75 mcg,,mcg

If fixed the scaling of units:

def real_data():
    # Nutrient requirements.
    nutrients = []
    units = []
    with open("Daily Recommended Values.csv") as csvfile:
        csvwreader = csv.reader(csvfile)
        next(csvwreader)  # Skip the header
        for row in csvwreader:
            parsed_row = [
                row[0],
                (
                    int(float(row[1].split(" ")[0]) * NUMBER_SCALE) if row[1] else 0,
                    row[1].split(" ")[1] if len(row[1].split(" ")) > 1 else None,
                ),
                (
                    int(float(row[2].split(" ")[0]) * NUMBER_SCALE) if row[2] else MAX_NUMBER * NUMBER_SCALE,
                    row[2].split(" ")[1] if len(row[2].split(" ")) > 1 else None,
                ),
            ]
            if parsed_row[2] is not None:
                try:
                    assert parsed_row[1] < parsed_row[2]
                except AssertionError as e:
                    print(parsed_row)
                    raise
            nutrients += [parsed_row]
            units.append(row[3].split(" ")[0])

    # Food nutrition information
    foods = []
    with open("Foods.csv") as csvfile:
        # creating a csv writer object
        csvwreader = csv.reader(csvfile)
        food_header = next(csvwreader)
        for row in csvwreader:
            # This is actually not necessary, as the input file is in the correct format already.
            parsed_row = [
                *row[:4],
                *[
                    (
                        int(float(x.split(" ")[0]) * NUMBER_SCALE),
                        x.split(" ")[1] if len(x.split(" ")) > 1 else None,
                    )
                    for x in row[4:]
                ],
            ]

            # Checks units and fix quantities if needed:
            for i, required in enumerate(nutrients):
                unit = units[i]
                if not row[4 + i]:
                    continue
                data = row[4 + i].split(" ")
                if len(data) == 1:
                    continue
                food_unit = data[1]
                if unit != food_unit:
                    if unit == 'mcg' and food_unit == 'µg':
                        continue
                    if unit == 'mcg' and food_unit == "mg":
                        print(parsed_row[4 + i])
                        parsed_row[4 + i] = (parsed_row[4 + i][0] * 1000, parsed_row[4 + i][1])
                        print(parsed_row[4 + i])
                        continue
                    print(f'mismatch: nutrient={nutrients[i][0]} unit=\'{unit}\' food_unit=\'{food_unit}\' row={row[4 + i]}')

            foods.append(parsed_row)

The callback code is wrong as it does not match the quantity values with the right index.

Here is the correct code:

class VarArraySolutionPrinter(cp_model.CpSolverSolutionCallback):
    def __init__(self, variables, nutritional_requirements, foods):
        cp_model.CpSolverSolutionCallback.__init__(self)
        self.__variables = variables
        self.__nutritional_requirements = nutritional_requirements
        self.__foods = foods

    def on_solution_callback(self):
        for i, nutrient in enumerate(self.__nutritional_requirements):
            the_sum = sum(
                [
                    food[i + FOOD_OFFSET][0] * self.Value(self.__variables[j])
                    for j, food in enumerate(self.__foods)
                ]
            )
            if the_sum > nutrient[2][0]:  # If a nutrient is greater than the max value.
                nutrient_quantity = [
                    nutrient[0],
                    nutrient[1][0],
                    the_sum,
                    nutrient[2][0],
                ]
                print(
                    f"A solution has been found, but the value for the nutrient {nutrient[0]} is too large. "
                    f"It should be between {nutrient[1][0]} and {nutrient[2][0]}, but it's {the_sum}."
                )

And implemented the error minimization:

NUMBER_SCALE = 1_000  # CP-SAT only does integers, so scale to use 3 decimal places.
FOOD_OFFSET = 4  # The first 4 columns of the food data are labels.
MAX_NUMBER = 50000000000
MAX_WEIGHT_IN_GRAMS = 10000


def solve_it(nutritional_requirements, foods):
    model = cp_model.CpModel()

    quantity_of_food = [
        model.NewIntVar(0, MAX_WEIGHT_IN_GRAMS, food[2]) for food in foods
    ]
    error_for_quantity = [
        model.NewIntVar(0, MAX_NUMBER, f"error {food[2]}") for food in foods
    ]

    for i, nutrient in enumerate(nutritional_requirements):
        nutrient_intake = sum(food[i + FOOD_OFFSET][0] * quantity_of_food[j] for j, food in enumerate(foods))
        model.AddLinearConstraint(nutrient_intake, nutrient[1][0], nutrient[2][0])
        model.Add(error_for_quantity[i] == nutrient_intake - nutrient[1][0])  # No need for abs().

    model.Minimize(sum(error_for_quantity))

    solver = cp_model.CpSolver()
    solver.parameters.log_search_progress = True
    solver.parameters.num_workers = 24
    # The solution callback displays the nutrients that are out of bounds.
    solution_checker = VarArraySolutionPrinter(quantity_of_food, nutritional_requirements, foods)
    solver.Solve(model, solution_checker)

Side node:

you should really use pandas instead of reading the csv by yourself. Scaling up is trivial with pandas.

Bah, I was sure I'd double checked that.... In addition to these two issues you mentioned (the unit mismatch for Copper and the solution printer's incorrect calculation of the error quantity), the food-value columns were in a different order than the nutritional-requirement rows. Fixing all this, the solver behaves as expected. Thanks for your help. — Travis, Aug 29 '23 at 04:41

VonC · Answer 2 · 2023-08-27T20:10:26.400

I see you have used constraints to express that certain conditions must be met by any solution the solver produces.

But you have enforced only lower bounds for each nutrient with this line:

model.Add(
    sum([food[i + FOOD_OFFSET][0] * quantity_of_food[j] for j, food in enumerate(foods)]) > nutrient[1][0]
)

That constraint make sure that the total quantity of each nutrient must be greater than some lower bound nutrient[1][0]. However, there are no upper bounds set on each nutrient, meaning the solver is free to select values that exceed any reasonable level: that could explain the problem you have observed.

To include both lower and upper bounds, you can use two separate constraints for each nutrient. That way, you can specify that the quantity of each nutrient must be both greater than or equal to a lower bound and less than or equal to an upper bound.

# Lower bound constraint
model.Add(
    sum([food[i + FOOD_OFFSET][0] * quantity_of_food[j] for j, food in enumerate(foods)]) >= nutrient[1][0]
)

# Upper bound constraint
model.Add(
    sum([food[i + FOOD_OFFSET][0] * quantity_of_food[j] for j, food in enumerate(foods)]) <= nutrient[1][1]
)

This assumes nutrient[1][0] and nutrient[1][1] represent the lower and upper bounds for each nutrient, respectively.

By setting both of these constraints, you are effectively limiting the range of each nutrient to between nutrient[1][0] and nutrient[1][1]. That should help prevent the solver from choosing nutrient values that are excessively high.

The ultimate goal is to bound it above, but as I mentioned, when I use AddLinearConstraint (or two constraints, as you suggest), the solver says the problem is "infeasible". This is because the first variable is too big.

While each constraint might seem reasonable on its own, they could collectively create an impossible scenario. Infeasibility might not be caused by a single constraint but rather by the interaction of multiple constraints.

The objective of minimizing the error term (sum(error_for_quantity)) may also interfere with the feasibility of the solution. It might be driving one of the variables (the first one, in your case) to an extreme value.

Numerical instability can also result from the scale of your variables or coefficients, leading to erroneous infeasibility conclusions.

To debug this issue, use verbose logging options to trace the solver's decisions. This might provide hints as to why the first variable is being pushed to a large value.

And try to temporarily remove or relax some of the constraints to see if the problem remains infeasible. This can help identify which constraint(s) are causing the issue.

The ultimate goal is to bound it above, but as I mentioned, when I use `AddLinearConstraint` (or two constraints, as you suggest), the solver says the problem is "infeasible". This is because the first variable is too big. My goal with this SO question is to understand why the solver is doing that so maybe I can get it to stop. — Travis, Aug 27 '23 at 01:40

What does OR-Tools' CP-SAT favor the first variable?

2 Answers2

Data is inconsistent

The callback code is wrong as it does not match the quantity values with the right index.