0

I have a dataframe with lots of NAN values. My objective is to find the best conbination of columns and rows to maximize my data and minimize the NAN values. One solution I found is to use the ompr package and create a MIP model to solve the problem.

Here's the model :

m <- +!is.na(M) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)
n_years <- 17

model <- MIPModel() %>% 
  
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  
  # keep at most n columns i.e. remove at least (nc - n_years) columns

  add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) 

model

This model works just fine as long as my data is not huge which is not the case for my data (3500 rows x 180 columns)

this model has a lot of contsraints which is why it takes too long to solve. Is there another way to recreate this model so it has less constraints and calculates faster?

Akram H
  • 33
  • 4

0 Answers0