-1

If I have a dataframe like this:

ID GroupID X  Y
1   a      772.7778 226.5
1   a      806.5645 35.3871
1   a      925.5714 300.9286
1   b      708.0909 165.5455
1   b      630.8235 167.4118
2   a      555.3333 151.875
2   a      732.8947 462.3158

Here is the result I want to have:

ID GroupID X        Y        Distance
1   a      772.7778 226.5    NA
1   a      806.5645 35.3871  dist between((772.7778,226.5),(806.5645,35.3871))
1   a      925.5714 300.9286 dist between((925.5714,300.9286),(806.5645,35.3871))
1   b      708.0909 165.5455 NA
1   b      630.8235 167.4118 dist between((708.0909,165.5455),(630.8235,167.4118))
2   a      555.3333 151.875  NA
2   a      732.8947 462.3158 dist between((732.8947,462.3158),(555.3333,151.875))

Basically is the distance within ID and GroupID. NA here means in each subgroup (e.g. ID=1; GroupID=a) the first distance is NA. Is there anyone can help me? Thanks!!!

coco
  • 871
  • 1
  • 7
  • 15

3 Answers3

2

This is a solution with dplyr and using dist to calculate the euclidean distance:

library(dplyr)

df <- read.table(text = "
  ID  GroupID X        Y
  1   a      772.7778 226.5
  1   a      806.5645 35.3871
  1   a      925.5714 300.9286
  1   b      708.0909 165.5455
  1   b      630.8235 167.4118
  2   a      555.3333 151.875
  2   a      732.8947 462.3158", header = T, stringsAsFactors = F)

df %>%
  group_by(ID, GroupID) %>%
  mutate(rows = row_number()) %>%
  left_join(df, by = c('ID', 'GroupID')) %>%
  rowwise() %>%
  mutate(Distance = ifelse(dist(rbind(c(X.x, Y.x), c(X.y, Y.y))) != 0,
                           dist(rbind(c(X.x, Y.x), c(X.y, Y.y))),
                           NA)) %>%
  filter(rows == 1) %>%
  select(ID, GroupID, X = X.y, Y= Y.y, Distance)

##      ID GroupID        X        Y  Distance
##   <int>   <chr>    <dbl>    <dbl>     <dbl>
## 1     1       a 772.7778 226.5000        NA
## 2     1       a 806.5645  35.3871 194.07648
## 3     1       a 925.5714 300.9286 169.95735
## 4     1       b 708.0909 165.5455        NA
## 5     1       b 630.8235 167.4118  77.28994
## 6     2       a 555.3333 151.8750        NA
## 7     2       a 732.8947 462.3158 357.63325
Oriol Mirosa
  • 2,756
  • 1
  • 13
  • 15
  • Thanks! I just updated my question description, the NA here is not for each even row, it is the value for each subgroup, like id=1 with groupid = a. Can you please help me to fix this part? – coco Aug 10 '17 at 18:39
  • I made changes based on the new info. Let me know if this is still not what you want. – Oriol Mirosa Aug 10 '17 at 19:12
  • There are some wrong data in the X column, but still thanks for your help! – coco Aug 10 '17 at 22:02
  • You're right! Grabbed the wrong column by mistake. Just changed it in the last line, the `select`. Instead of making `X.x` as the `X` column, it actually is `X.y`. I think now you get what you want. – Oriol Mirosa Aug 10 '17 at 22:10
2

Never used a dist before, but here is a for loop that might work for you:

> for(i in 1:nrow(df)) {
  if(i > 1 && df$GroupID[i] == df$GroupID[i-1]) {
   df$Distance[i] <- sqrt(((df$X[i] - df$X[i-1]) ^ 2) + ((df$Y[i] - df$Y[i-1]) ^ 2))
  } else {
     df$Distance[i] <- NA
    }
  }

> df
  ID GroupID        X        Y  Distance
1  1       a 772.7778 226.5000        NA
2  1       a 806.5645  35.3871 194.07648
3  1       a 925.5714 300.9286 290.98957
4  1       b 708.0909 165.5455        NA
5  1       b 630.8235 167.4118  77.28994
6  2       a 555.3333 151.8750        NA
7  2       a 732.8947 462.3158 357.63325
Sagar
  • 2,778
  • 1
  • 8
  • 16
  • Thanks for this solution. But if is the Euclidian distance, it should be something like sqrt((df$X[i] - df$X[i-1])^2 + (df$Y[i] - df$Y[i-1)^2). And for the NA value, I tried to put it to the first distance in each subgroup, not the even number rows. Can you plz help me to change this part? – coco Aug 10 '17 at 18:36
  • @ycw, coco - My bad. I should have researched Euclidean distance. I took simple numbers and applied towards sample data frame that was posted earlier in the question. I will edit my answer, once I' ve figured out. – Sagar Aug 10 '17 at 18:42
  • @ycw, coco - The updated `loop` returns proper distance. – Sagar Aug 10 '17 at 19:28
1

Why not try something like:

Splitting the data based on a combination of the IDs, applying a distance function, and then unsplit?

splitted <- split(dat[,c("X","Y")], paste(dat$ID,dat$GroupID))

distances <- lapply(splitted, function(x) {
 if(nrow(x) > 2){ # diag() is useless for <= 2x2 matrix
   c(NA,diag(as.matrix(dist(x))[,-1]))
 } else {
   c(NA,dist(x)[1])
 }
})

dat$distances <- unsplit(distances, paste(dat$ID,dat$GroupID))

dat
  ID GroupID        X        Y distances
1  1       a 772.7778 226.5000        NA
2  1       a 806.5645  35.3871 194.07648
3  1       a 925.5714 300.9286 290.98957
4  1       b 708.0909 165.5455        NA
5  1       b 630.8235 167.4118  77.28994
6  2       a 555.3333 151.8750        NA
7  2       a 732.8947 462.3158 357.63325

side note: if each group is over 10k rows, dist will get slow.

Evan Friedland
  • 3,062
  • 1
  • 11
  • 25