1

I have a dataframe, df with two variables as given below. Using the code below, I want to get matrix "mat".

This code works perfect for unique(df$V1)= 3 but it takes a lot of time (>10 hours) for operations where unique(df$V1) is in 1000's.

Dataframe

V1   V2
1   60
1   30
1   38
1   46
2   29
2   35
2   13
2   82
3   100
3   72
3   63
3   45

Code:

#Unique V1 values
vec <- unique(df$V1)
#Count <= valies
val <- combn(vec, 2, function(x) 
  sum(outer(df$V2[df$V1 == x[1]], df$V2[df$V1 == x[2]], `<=`)))
val
#[1]  5 14 13

#Create an empty matrix
mat <- matrix(0,length(vec), length(vec))
#Fill the lower triangle of the matrix. 
mat[lower.tri(mat)] <- val
mat

Basically, for V1=1 we want to compare all values of V2 with all values of V2 for V1= 2 and 3. Repeat the same for V1=2 and V1=3. In other words, for a given value of V1 we want to see if the values in V2 are less than the values in V2 for rest values in V1. For instance we compare the values in V2 for V1=1 and V1=2. If the value in V2 for V1=1 is less than value in V2 for V1=2, then the return value is 1 else 0. For example:

For V1=1->
( 60 > 29 : returns 0,
60 > 35 : returns 0,
60 > 13 : returns 0,
60 < 82 : returns 1,
30 > 29 : returns 0,
30 < 35 : returns 1,
30 > 13 : returns 0,
30 < 82 : returns 1,
38 > 29 : returns 0,
38 > 35 : returns 0,
38 > 13 : returns 0,
38 < 82 : returns 1,
46 > 29 : returns 0,
46 > 35 : returns 0,
46 > 13 : returns 0,
30 < 82 : returns 1)=Sum is 5 (i.e. mat[1,2])
vp_050
  • 583
  • 2
  • 4
  • 16

3 Answers3

7

This should be lightning fast for this problem and not use excessive memory.

library(data.table)
setDT(df)
numvec <- max(df[,V1])
dl <- lapply(1:numvec, function(i) df[V1 == i, sort(V2)])
dmat <- CJ(x=1:numvec, y=1:numvec)[, .(z = sum(findInterval(dl[[y]],dl[[x]]))), .(x,y)]
mat <- as.matrix(dcast(dmat, x~y, value.var = 'z')[, -'x'])
pseudospin
  • 2,737
  • 1
  • 4
  • 19
  • This is great. I'm embarrassed I didn't think of it. – Ian Campbell Dec 19 '20 at 19:01
  • Can this code be any further improved to fasten the process? or run it differently for upper triangle and lower triangle and then merge? – vp_050 Feb 10 '21 at 19:13
  • I suppose you could do the calculation for the upper triangle and then generate the lower triangle by subtratcing from the product of the lengths of the vectors. But you're only gonna get a factor of 2 at best. There's also the issue of equality which confuses things. – pseudospin Feb 10 '21 at 21:20
2

I'm having trouble figuring out exactly what you want, because I don't think your matrix should be symmetric.

Perhaps this option with data.table::CJ is what you're looking for:

library(data.table)
setDT(df)
result <- df[,CJ(A = V1, B = V1,unique=TRUE)][
  ,.(sum(sapply(df[V1==A,V2],function(x)x <= df[V1==B,V2]))),by = c("A","B")]
result
   A B V1
1: 1 1 10
2: 1 2  5
3: 1 3 14
4: 2 1 11
5: 2 2 10
6: 2 3 13
7: 3 1  2
8: 3 2  3
9: 3 3 10

mat <- matrix(result$V1, ncol = length(unique(df$V1)), nrow = length(unique(df$V1)))
diag(mat) <- 0
mat
     [,1] [,2] [,3]
[1,]    0   11    2
[2,]    5    0    3
[3,]   14   13    0
set.seed(3)
df2 <- data.table(V1 = sample(1:100,1000,TRUE), V2 = sample(10:100,1000,TRUE))
system.time(df2[,CJ(A = V1, B = V1,unique=TRUE)][
                ,.(sum(sapply(df2[V1==A,V2],function(x)x <= df2[V1==B,V2]))),by = c("A","B")])
   user  system elapsed 
118.817   1.081 119.949 
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Thank you. This is exactly what I am looking for. However for me "V1" has 2552 unique values. And I am receiving memory error. System configuration System type-64-bit operating system RAM SIZE- 16 GB What I have tried? a) Increased memory limit: memory.limit(size = 10000000000000) – vp_050 Dec 19 '20 at 17:58
  • Yeah, there's no way that will fit into memory. That's `6512704` combinations. You could try to split `df[,CJ(A = V1, B = V1,unique=TRUE)]` into multiple data.tables and parallelize it, but it will still take a long time. You're looking at 21 hours on a single core of my laptop assuming 10 `V2`s per `V1`. – Ian Campbell Dec 19 '20 at 18:13
1

Here's an approach that avoids outer.

sapply(combn(split(df$V2, df$V1), 2, simplify = FALSE), function(x){
    sum(sapply(x[[1]], function(a) sum(a <= x[[2]])))
})
# [1]  5 14 13

Or

sapply(vec, function(x) sapply(vec, function(y){
    if (x == y) {
        0
    } else {
        d1 = df$V2[df$V1 == x]
        d2 = df$V2[df$V1 == y]
        sum(sapply(d1, function(a) sum(a <= d2)))
    }
}))
#     [,1] [,2] [,3]
#[1,]    0   11    2
#[2,]    5    0    3
#[3,]   14   13    0
d.b
  • 32,245
  • 6
  • 36
  • 77