Efficient random subsample of a Pandas DataFrame based on multiple columns target requirements

Question

I am currently working on a project where I have a large DataFrame (500'000 rows) containing polygons as rows, with each polygon representing a geographical area. The columns of the DataFrame represent different landcover classes (34 classes), and the values in the cells represent the area covered by each landcover class in square kilometers.

My objective is to subsample this DataFrame based on target requirements for landcover classes. Specifically, I want to select a subset of polygons that collectively meet certain target coverage requirements for each landcover class. The target requirements are specified as the desired area coverage for each landcover class.

Some collegue hinted that this could be interpreted as an optimisation problem with an objective function. However, I have not found a solution to it yet and tried a different, slow, iterative approach (see below).

To give you a better understanding, here is a minimum reproducible example of my DataFrame structure with only 4 polygons and 3 classes:

import pandas as pd

# Create a sample DataFrame
data = {
    'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
    'Landcover 1': [10, 5, 7, 3],
    'Landcover 2': [15, 8, 4, 6],
    'Landcover 3': [20, 12, 9, 14]
}

df = pd.DataFrame(data)

For instance, let's say I have the following target requirements for landcover classes:

target_requirements = {
    'Landcover 1': 15,
    'Landcover 2': 20,
    'Landcover 3': 25
}

Based on these target requirements, I would like to subsample the DataFrame by selecting a subset of polygons that collectively meet or closely approximate the target area coverage for each landcover class. In this example, the polygons A and C are good subsamples as their landcover coverages summed together comes close to the requirements I set.

My [extended] code so far

Here is what I coded so far. You will see some extra steps which are implemented here:

Weights: to guide the selection of polygons using deficits and surplus
Random sampling of top 0.5%: based on weights, I select the top 0.5% polygons and randomly pick 1 from this selection.
Tolerance: I set a tolerance for discrepancies between cumulated areas found with the current subsample and the requirements needed.
Progress bar: aesthetic.

import numpy as np
import pandas as pd
from tqdm import tqdm

def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
    selected_polygon = row[landcover_columns]

    # Add the selected polygon to the subsample
    subsample = selected_polygon.to_frame().T
    cumulative_coverages += selected_polygon.values

    return cumulative_coverages, subsample

df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class

total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)

df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)

# Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005

# Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)

# Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)

# Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
    remaining_diff = target_coverages - cumulative_coverages
    deficit = remaining_diff.clip(lower=0)
    surplus = remaining_diff.clip(upper=0) * 0.1
    deficit_sum = deficit.sum()

    normalized_weights = deficit / deficit_sum

    # Calculate the combined weights for deficit and surplus for the entire dataset
    weights = df_data[landcover_columns].mul(normalized_weights) + surplus

    # Calculate the weight sum for each polygon
    weight_sum = weights.sum(axis=1)

    # Select the top 1% polygons based on weight sum
    top_percentile = int(len(df_data) * top_percentage)
    top_indices = weight_sum.nlargest(top_percentile).index
    selected_polygon_index = np.random.choice(top_indices)

    selected_polygon = df_data.loc[selected_polygon_index]

    cumulative_coverages, subsample_iteration = select_polygons(
        selected_polygon, cumulative_coverages, landcover_columns, target_coverages
    )

    # Add the selected polygon to the subsample
    subsample = subsample.append(subsample_iteration)

    df_data = df_data.drop(selected_polygon_index)

    # Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
    if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
        break

    # Calculate the percentage of coverage achieved
    coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100

    # Update tqdm progress bar
    progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
    progress_bar.update(1)

progress_bar.close()

subsample.reset_index(drop=True, inplace=True)

The problem

Code is slow (10 iterations/s) and doesn't manage well tolerance, i.e I can get cumulative_coverages way above 100% while tolerance is not met yet ( my "guidance for selection" is not good enough). Plus, there must be a much better OPTIMISATION to get what I want.

Any help/idea would be appreciated.

The question is quite extensive. Are there limitations on how many landmasses can be selected? You could sort by `landcover` and use some sort of greedy algorithm to select. As this is a grouping problem, the search space is large. — vahvero, Jul 03 '23 at 07:07
thx @vahvero. No limitations on how many landmasses (polygon) can be selected. I would say around 95% of the polygons (rows) are 1km2 in total while a few are extensive. The number of selected polygons will basically depend on the requirments set. Thanks for pointing to greedy algorithms, did you have anything in particular in mind? — Nick Pucino, Jul 03 '23 at 07:32
Well, basically you could device a selection which finds a solution quickly but is unoptimal. Greedy in this contexts means that the solution is not guaranteed to be optimal. As your colleague hinted, `sum((targets_i - selected _i) ** 2)` could be used as an objective function. This is basically a https://en.wikipedia.org/wiki/Subset_sum_problem, and as such, no non-exponential solutions exist. — vahvero, Jul 03 '23 at 08:09
While generally the problem is a decision problem and NP-hard, most practical instances of the problem are very much tractable and solvable with contemporary solvers. — Sebastian Wozny, Jul 03 '23 at 16:08
If you can link actual target requirements/ actual landcover of the polygons i'll give it a try — Sebastian Wozny, Jul 03 '23 at 16:49
thx @SebastianWozny for your extensive answer! I will give this a try right now. Here is a repo I created to share the data for you to play with (if you want): https://github.com/npucino/test_data — Nick Pucino, Jul 04 '23 at 03:44
some of the landcover requirements just can't be satisfied because no polygon covers them at all, e.g. `df['n30'].sum() ==0`. can you specify your requirements better? — Sebastian Wozny, Jul 04 '23 at 07:43
excluding the infeasible `n30`, it solved in 300 seconds. check my edit. Please accept the answer if it solved your problem — Sebastian Wozny, Jul 04 '23 at 08:54
thx @SebastianWozny! in the code you wrote for the toy dataset here an import PULP_CBC_CMD is missing. I am now getting a PulpSolverError when running model.solve(PULP_CBC_CMD(fracGap = 0.9)) which I suspect it is due to permissions, which I should get around by using my personal PC rather than the work one. It cannot run cbc.exe apparently. But apart from that, I think it will work so I am happy to accept this answer and thank you very much for your help! — Nick Pucino, Jul 05 '23 at 05:49

Sebastian Wozny · Accepted Answer · 2023-07-04T09:04:27.413

The problem description lends itself to solving with a Mixed-Integer-Linear Program (MIP).

A convenient library for solving mixed integer problems is PuLP that ships with the built-in Coin-OR suite and in particular the integer solver CBC.

Note that I changed the data structure from the DataFrame in the example because the code gets really messy with your original DataFrame as lookups without index use a lot of .loc which I personally dislike :D

data = {
    'Polygon A': {
        'Landcover 1': 10,
        'Landcover 2': 15,
        'Landcover 3': 20
    },
    'Polygon B': {
        'Landcover 1': 5,
        'Landcover 2': 8,
        'Landcover 3': 12
    },
    'Polygon C': {
        'Landcover 1': 7,
        'Landcover 2': 4,
        'Landcover 3': 9
    },
    'Polygon D': {
        'Landcover 1': 3,
        'Landcover 2': 6,
        'Landcover 3': 14
    }
}

target_requirements = {
    'Landcover 1': 15,
    'Landcover 2': 20,
    'Landcover 3': 25
}

Then we formulate the linear problem:

from pulp import LpProblem, LpVariable, LpMinimize, lpSum, value



model = LpProblem("LandcoverOptimization", LpMinimize)

# Create a binary variable for each polygon to indicate whether it was selected in the solution
polygons = list(data.keys())
selected = LpVariable.dicts("Selected", polygons, cat='Binary')

# Minimize the amount of selected polygons
model += lpSum(selected[polygon] for polygon in polygons)


# We need to approximately cover the target required landcover +- tolerance
# Also: We have to select a polygon to add it to the solution, this will be minimized by the objective
tolerance = 0.1
# We create the sums only once for performance
sums = {}
for landcover in target_requirements:
    sums[landcover] = lpSum(landcovers[polygon][landcover] * selected[polygon] for polygon in polygons)
for landcover in target_requirements:
    model += sums[landcover] >= target_requirements[landcover] * (1 - tolerance)
    model += sums[landcover] <= target_requirements[landcover] * (1 + tolerance)

model.solve(PULP_CBC_CMD(fracGap = 0.9))# Set mip gap very permissively for performance

To visualize the output we extract the solution:

selected_polygons = [polygon for polygon in polygons if value(selected[polygon]) == 1.0]
print("Selected Polygons:")
for polygon in selected_polygons:
    print(polygon)

achieved_landcovers = {}
for landcover in target_requirements:
    total_landcover = sum(data[polygon][landcover] for polygon in selected_polygons)
    achieved_landcovers[landcover] = total_landcover

print("Required Landcovers:")
for landcover, requirement in target_requirements.items():
    print(f"{landcover}: {requirement}")

print("Achieved Landcovers:")
for landcover, coverage in achieved_landcovers.items():
    print(f"{landcover}: {coverage}")

Output:

Selected Polygons:
Polygon A
Polygon C
Required Landcovers:
Landcover 1: 15
Landcover 2: 20
Landcover 3: 25
Achieved Landcovers:
Landcover 1: 17
Landcover 2: 19
Landcover 3: 29

With regard to tractability:

I've solved some MIP models with ~3 million binary variables. This question says 5 million variables won't solve for them. Playing with tolerances, as well as the optimality gap will enable you to dial into a useful and practical solution for your problem. The output will also contain logs of measures of optimality and/or prove infeasibility of the problem(e.g. if you set your tolerances too tightly) :

Result - Optimal solution found

Objective value:                2.00000000
Enumerated nodes:               0
Total iterations:               1
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.00

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.01

If the amount of data is truly huge and the tolerances need to be very tight, CBC might no longer be able to handle the problem. Commercially available solvers such as Gurobi will handle much larger problem instances, but aren't available for free.

In my experience, if a solver can't find a satisfying solution, there is no way you can write some python/pandas code that will.

To get the mindset of describing problems, rather than thinking of solutions I recommend Raymond Hettinger's excellent talk on PyCon 2019 - Modern solvers: Problems well-defined are problems solved

to install PuLP

pip install pulp

EDIT: For OP's 88MB problem instance on my MacBook Pro:

Result - Optimal solution found (within gap tolerance)

Objective value:                14178.00000000
Lower bound:                    14080.357
Gap:                            0.01
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             218.06
Time (Wallclock seconds):       219.21

Option for printingOptions changed from normal to all
Total time (CPU seconds):       293.59   (Wallclock seconds):       296.26

Tweaking the tolerance and MIP gap will improve runtime or solution quality.

It can prove at least 14081 polygons need to be chosen and it found a solution that chooses 14178 polygons(1% optimality gap) with a coverage tolerance of +- 3%:

Landcover Comparison:
n20: Required=3303.7, Achieved=3204.7, Difference=99.0
n25: Required=20000.0, Achieved=19401.1, Difference=598.9
n5: Required=20000.0, Achieved=19400.0, Difference=600.0
n16: Required=8021.1, Achieved=7781.3, Difference=239.8
....

Efficient random subsample of a Pandas DataFrame based on multiple columns target requirements

My [extended] code so far

The problem

1 Answers1