0

I have a data frame that contains all the possible combinations between the elements of two vectors and for each combination I have a corresponding score. I was trying to find an efficient way to find the subset of unique pairs with unique elements (i.e., an element from one vector can be found only once across all pairs) that maximizes the sum of scores corresponding to each combination. As example data, consider this df:

df = data.frame(Var1 = c("A", "B", "C"), Var2 = c("A", "C", "D"))
df = expand.grid(df$Var1, df$Var2)
df$score = c(1, 0.5, 2, 1, 0.5, 0.5, 1, 2, 1)
> df
  Var1 Var2 score
1    A    A   1.0
2    B    A   0.5
3    C    A   2.0
4    A    C   1.0
5    B    C   0.5
6    C    C   0.5
7    A    D   1.0
8    B    D   2.0
9    C    D   1.0
> 

The expected result would be:

A  C  1
B  D  2
C  A  2

Note that you can have overlap between the elements of the two vectors, but again each element from each vector should appear only once. Also, the pair A A 1 is allowed and would've been possible, but that would make it impossible then to generate the pair C A 2 which would increase the overall sum of the score. As an attempt I have used this one liner with the functionality from dplyr

df <- df %>% group_by(Var1) %>% slice(which.max(score)) %>% as.data.frame()

which produces:

> df
  Var1 Var2 score
1    A    A     1
2    B    D     2
3    C    A     2

which is close enough.. but the A from the second vector is repeated. Do you have any suggestions? Thank you in advance!

Marius
  • 990
  • 1
  • 14
  • 34
  • You have a tie and it chose the first entry, that's why it chose the `A A` – Sotos Mar 19 '19 at 15:45
  • Yeah, that I figured. That's the `which.max` behavior, but I need to overcome that, and was trying to avoid writing some recursive function to do the job. Tried also to convert it into a matrix with columns being the unique elements from one vector and rows from the others, but then I got stuck retrieving the row-column pairs maximizing the sum – Marius Mar 19 '19 at 15:47

1 Answers1

0

Well, I eventually found the solution based on the Hungarian algorithm implemented in the solve_LSAP function of the clue R package. To have it working, transform your df in a matrix like so:

df = matrix(sapply(df$score, function(x) x), nrow=length(unique(df$Var1)), ncol=length(unique(df$Var2)), dimnames = list(unique(df$Var1), unique(df$Var2)))

and apply the function

df.res = solve_LSAP(df, maximum = T)
> df.res
Optimal assignment:
1 => 2, 2 => 3, 3 => 1

and then get back the actual nodes or names

df.res = cbind(rownames(df), colnames(df)[df.res])
> df.res
     [,1] [,2]
[1,] "A"  "C" 
[2,] "B"  "D" 
[3,] "C"  "A" 
> 

Tadaaaaam!

Marius
  • 990
  • 1
  • 14
  • 34