how to subset the rows in a data frame with coordinates based on closest location in R

Question

I have a large data frame (150000 rows) with X and Y as coordinates like df1 as follows:

df1 <- data.frame(X = c(7.48, 7.82, 8.15, 8.47, 8.80, 9.20, 9.51, 9.83, 10.13, 10.59, 7.59, 8.06, 8.39, 8.87, 9.26, 9.64, 10.09, 10.48, 10.88, 11.45), 
              Y = c(49.16, 48.78, 48.40, 48.03, 47.65, 47.24, 46.87, 46.51, 46.15, 45.73, 48.70, 48.18, 47.72, 47.20, 46.71, 46.23, 45.72, 45.24, 44.77, 44.23), 
              ID = c("B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2"), 
              TI = c(191.31, 191.35, 191.39, 191.44, 191.48, 191.52, 191.56, 191.60, 191.64, 191.69, 1349.93, 1349.97, 1350.01, 1350.05, 1350.09, 1350.14, 1350.18, 1350.22, 1350.26, 1350.30))

in ID column, i have some 100-200 unique ID's and in each unique ID, i have 200-300 data points (rows)

i have another data frame like df2 as follows:

df2 <- data.frame(X = c(7.62,  8.25,  8.95,  9.71,  10.23), 
              Y = c(49.06,  48.30,  47.55,  46.77,  46.25))

now, based on each row in df2 i.e. x1 and y1, I would like to find out nearest XY in df1 with respect to a unique ID shown as:

 df3 <- 

 X1    Y1   ID1     TI1     X2     Y2    ID2     TI2     X3     Y3    ID3     TI3    X4     Y4    ID4     TI4     X5      Y5       ID5      TI5
7.48 49.16  B_1    191.31  8.15  48.40   B_1    191.39  8.80  47.65   B_1    191.48  9.51  46.87  B_1    191.56   10.13   46.15   B_1     191.64
7.59 48.70  B_1_2  1349.93 8.06  48.18   B_1_2  1349.97 8.87  47.20   B_1_2  1350.05 9.26  46.71  B_1_2  1350.09  10.09   45.72   B_1_2   1350.18

i have tried with the following code:

dist12 <- function(row){
dists <- (row[["X"]] - df2$X)^2 + (row[["Y"]]- df2$Y)^2
return(cbind(df2[which.min(dists),], distance = min(dists)))
}
df3 <- cbind(df1, do.call(rbind, lapply(1:nrow(df1), function(x) dist12(df1[x,]))))

the code is finding the minimum distance between rows by calculating the distance between rows in df1 and df2 dataframes and combining both df1 and df2. from this code, it is assigning a single XY of df2 to multiple rows in df1. but a single row (XY) in df2 can be assigned only to one of the rows in a unique ID.

looking for the code to get the expected output (df3) as presented above

thanks in advance

score 3 · Accepted Answer · answered Jan 31 '20 at 10:01

3

Here is a taste with data.table for performance.

    library(data.table)
    df1=as.data.table(df1)
    do.call(cbind,
      apply(df2,1,function(i){
        df1[,d:=(df1$X-i[1])^2+(df1$Y-i[2])^2]
        df1[df1[,.I[d==min(d)],by=ID]$V1]
      })
    )

      X     Y    ID      TI      d    X     Y    ID      TI      d    X     Y    ID      TI      d
1: 7.48 49.16   B_1  191.31 0.0296 8.15 48.40   B_1  191.39 0.0200 8.80 47.65   B_1  191.48 0.0325
2: 7.59 48.70 B_1_2 1349.93 0.1305 8.06 48.18 B_1_2 1349.97 0.0505 8.87 47.20 B_1_2 1350.05 0.1289
      X     Y    ID      TI      d     X     Y    ID      TI      d
1: 9.51 46.87   B_1  191.56 0.0500 10.13 46.15   B_1  191.64 0.0200
2: 9.26 46.71 B_1_2 1350.09 0.2061 10.09 45.72 B_1_2 1350.18 0.3005

answered Jan 31 '20 at 10:01

user2974951

9,535
1
17
24

after this, is it possible to split the dataframe into list such that the first 5 columns in one dataframe of list and next 6 to 10 columns in another dataframe of a the list and so on ?! – Kumar Feb 01 '20 at 07:11
1

@Kumar I'm not entirely sure what you mean, do you mean without using the do.call(cbind) call? That will give you a list with data frames. – user2974951 Feb 01 '20 at 10:08
Yes i am looking that it gives us a list with dataframes – Kumar Feb 03 '20 at 04:14
how the 'd' values are calculated in the given solution...... because......when I calculated in excel, I got the values as '0.172047', '0.361248' and so on...... ??? – Kumar Dec 14 '20 at 15:13

score 0 · Answer 2 · answered Jan 31 '20 at 13:13

Here is another option using Florian Privé's fast rowwise pairwise Euclidean distance between 2 matrices here:

m2 <- as.matrix(df2)

#https://stackoverflow.com/questions/59679046/speed-challenge-any-faster-method-to-calculate-distance-matrix-between-rows-of/59687204#59687204
method_XXX <- function(A, B) {
    outer(rowSums(A^2), rowSums(B^2), '+') - tcrossprod(A, 2 * B)
}

setDT(df1)[, {
    #calculate sq Euclidean distance and find min in each row using `max.col(-`
    ans <- c(.(ID=rep(ID, nrow(m2))), 
        .SD[max.col(-method_XXX(m2, as.matrix(.SD[, -"TI"])))])

    #create desired output
    cnames <- paste0(rep(c("ID", names(.SD)), nrow(m2)), 
        rep(1L:nrow(m2), each=ncol(.SD)+1L))
    setNames(as.list(c(do.call(rbind, ans))), cnames)
}, ID]

output:

      ID   ID1   X1    Y1     TI1   ID2   X2    Y2     TI2   ID3   X3    Y3     TI3
1:   B_1   B_1 7.48 49.16  191.31   B_1 8.15  48.4  191.39   B_1  8.8 47.65  191.48
2: B_1_2 B_1_2 7.59  48.7 1349.93 B_1_2 8.06 48.18 1349.97 B_1_2 8.87  47.2 1350.05
     ID4   X4    Y4     TI4   ID5    X5    Y5     TI5
1:   B_1 9.51 46.87  191.56   B_1 10.13 46.15  191.64
2: B_1_2 9.26 46.71 1350.09 B_1_2 10.09 45.72 1350.18

how to subset the rows in a data frame with coordinates based on closest location in R

2 Answers2