I have a dataframe with lots of NAN values. My objective is to find the best conbination of columns and rows to maximize my data and minimize the NAN values. One solution I found is to use the ompr package and create a MIP model to solve the problem.
Here's the model :
m <- +!is.na(M) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 17
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n columns i.e. remove at least (nc - n_years) columns
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years)
model
This model works just fine as long as my data is not huge which is not the case for my data (3500 rows x 180 columns)
this model has a lot of contsraints which is why it takes too long to solve. Is there another way to recreate this model so it has less constraints and calculates faster?