I am currently working on a project where I have a large DataFrame (500'000 rows) containing polygons as rows, with each polygon representing a geographical area. The columns of the DataFrame represent different landcover classes (34 classes), and the values in the cells represent the area covered by each landcover class in square kilometers.
My objective is to subsample this DataFrame based on target requirements for landcover classes. Specifically, I want to select a subset of polygons that collectively meet certain target coverage requirements for each landcover class. The target requirements are specified as the desired area coverage for each landcover class.
Some collegue hinted that this could be interpreted as an optimisation problem with an objective function. However, I have not found a solution to it yet and tried a different, slow, iterative approach (see below).
To give you a better understanding, here is a minimum reproducible example of my DataFrame structure with only 4 polygons and 3 classes:
import pandas as pd
# Create a sample DataFrame
data = {
'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
'Landcover 1': [10, 5, 7, 3],
'Landcover 2': [15, 8, 4, 6],
'Landcover 3': [20, 12, 9, 14]
}
df = pd.DataFrame(data)
For instance, let's say I have the following target requirements for landcover classes:
target_requirements = {
'Landcover 1': 15,
'Landcover 2': 20,
'Landcover 3': 25
}
Based on these target requirements, I would like to subsample the DataFrame by selecting a subset of polygons that collectively meet or closely approximate the target area coverage for each landcover class. In this example, the polygons A and C are good subsamples as their landcover coverages summed together comes close to the requirements I set.
My [extended] code so far
Here is what I coded so far. You will see some extra steps which are implemented here:
- Weights: to guide the selection of polygons using deficits and surplus
- Random sampling of top 0.5%: based on weights, I select the top 0.5% polygons and randomly pick 1 from this selection.
- Tolerance: I set a tolerance for discrepancies between cumulated areas found with the current subsample and the requirements needed.
- Progress bar: aesthetic.
import numpy as np
import pandas as pd
from tqdm import tqdm
def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
selected_polygon = row[landcover_columns]
# Add the selected polygon to the subsample
subsample = selected_polygon.to_frame().T
cumulative_coverages += selected_polygon.values
return cumulative_coverages, subsample
df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class
total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)
df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)
# Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005
# Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)
# Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)
# Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
remaining_diff = target_coverages - cumulative_coverages
deficit = remaining_diff.clip(lower=0)
surplus = remaining_diff.clip(upper=0) * 0.1
deficit_sum = deficit.sum()
normalized_weights = deficit / deficit_sum
# Calculate the combined weights for deficit and surplus for the entire dataset
weights = df_data[landcover_columns].mul(normalized_weights) + surplus
# Calculate the weight sum for each polygon
weight_sum = weights.sum(axis=1)
# Select the top 1% polygons based on weight sum
top_percentile = int(len(df_data) * top_percentage)
top_indices = weight_sum.nlargest(top_percentile).index
selected_polygon_index = np.random.choice(top_indices)
selected_polygon = df_data.loc[selected_polygon_index]
cumulative_coverages, subsample_iteration = select_polygons(
selected_polygon, cumulative_coverages, landcover_columns, target_coverages
)
# Add the selected polygon to the subsample
subsample = subsample.append(subsample_iteration)
df_data = df_data.drop(selected_polygon_index)
# Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
break
# Calculate the percentage of coverage achieved
coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100
# Update tqdm progress bar
progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
progress_bar.update(1)
progress_bar.close()
subsample.reset_index(drop=True, inplace=True)
The problem
Code is slow (10 iterations/s) and doesn't manage well tolerance, i.e I can get cumulative_coverages way above 100% while tolerance is not met yet ( my "guidance for selection" is not good enough). Plus, there must be a much better OPTIMISATION to get what I want.
Any help/idea would be appreciated.