How to loop distance calculations for multiple instances using dplyr in r

Question

I have positional data, an example of which is shown below, where time is the time each position was recorded, ref is the reference of each point, x is the x coordinate for each point and y is the y coordinate for each point.

    > print(df)
   time ref     x     y
1     1   1 92.80 49.58
2     1   2 90.20 96.02
3     1   3 91.61 80.05
4     1   4 68.75 20.56
5     1   5  5.53 35.27
6     1   6 39.85 85.39
7     1   7 12.04 87.43
8     1   8 42.98 56.53
9     1   9 19.14 63.56
10    1  10 25.72  7.62
11    2   1 50.39  7.16
12    2   2 17.71  7.15
13    2   3 52.96 34.87
14    2   4 52.70 97.07
15    2   5 70.88 44.88
16    2   6 32.12 71.82
17    2   7 24.15 22.77
18    2   8 18.06 31.03
19    2   9 70.55 92.42
20    2  10 45.05 79.67

the steps I want to take are as follows (steps 1 to 4 are completed successfully)

replicate the x and y coordinates multiple times with small errors
calculate the distance between every point at each instant of time
calculate the sum of these 45 distances for each instance of time
repeat this process across all the different iterations I created in step 1
create a new dataframe with all this information on it

step 1.

set.seed(456) #set seed to get consistent results

n <- 3 # this is 3 for this example but would likely be 1000 or 10000 and refers to the number of simulations


for(i in seq(5,(2*n+3),2)){ #create simulations of the xy data set
  df[,i] = df[,3] + rnorm(length(df[,2]),0,1) #replicates the x column 
  df[,i+1] = df[,4] + rnorm(length(df[,3]),0,1) # replicates the y column
}

This code works and is easily adjustable and gives me the following df. The first 4 columns are exactly the same as above. V5 and V6 are the x and y coordinates for n=1 which has a small error from the original x and y (you can see how similar these values are) V7 and V8 are x and y for n=2 and V9 and V10 are x and y for n=3

 print(df)
   time ref     x     y        V5        V6        V7        V8        V9       V10
1     1   1 92.80 49.58 91.456479 49.105396 92.771058 47.325290 91.720518 49.698151
2     1   2 90.20 96.02 90.821776 94.302691 90.593037 95.037940 89.758626 96.889903
3     1   3 91.61 80.05 92.410875 78.623170 91.360386 79.849432 93.630635 79.958064
4     1   4 68.75 20.56 67.361108 20.768236 68.833450 21.455930 68.822856 20.628899
5     1   5  5.53 35.27  4.815643 35.234164  7.608875 35.226455  6.238817 33.587573
6     1   6 39.85 85.39 39.525939 86.524285 39.970852 87.037308 40.700509 86.506956
7     1   7 12.04 87.43 12.730643 86.967145 12.158149 88.993299 10.553803 86.078642
8     1   8 42.98 56.53 43.230548 56.201616 43.750054 55.098622 43.900530 55.992833
9     1   9 19.14 63.56 20.147352 65.044539 17.964598 63.015406 19.288329 63.189886
10    1  10 25.72  7.62 26.293235  6.530622 26.129039  6.848746 25.483132  7.974012
11    2   1 50.39  7.16 49.474189  6.631206 49.725049  6.990012 49.916764  6.350175
12    2   2 17.71  7.15 19.021097  6.556207 17.453475  7.109238 17.040794  6.970275
13    2   3 52.96 34.87 53.948726 32.871084 53.638782 33.149460 54.318527 33.722340
14    2   4 52.70 97.07 54.353929 97.366153 53.596845 98.514106 54.112918 97.166242
15    2   5 70.88 44.88 69.439195 45.050625 71.498356 44.859985 70.147226 45.694700
16    2   6 32.12 71.82 34.067356 73.635652 32.851454 72.090232 32.039448 72.802941
17    2   7 24.15 22.77 25.886936 22.109397 23.736825 22.657066 24.960197 23.620843
18    2   8 18.06 31.03 18.447483 30.889748 19.617813 30.175112 18.562588 32.237347
19    2   9 70.55 92.42 72.830034 91.996021 71.091699 91.386259 71.674023 90.986222
20    2  10 45.05 79.67 46.587883 79.631264 45.627150 79.892027 44.878720 78.569054

step 2

I have created code using dplyr which groups the data by time and then calculates the distance between each reference point (this code is shown in step 3). there are 10 reference points which result in 45 distances to be calculated (10 choose 2).

step 3 for each group of time, I want to calculate the sum of all 45 distances. steps 2 and 3 are in the following code which has been made into a function

sumdist = function(data) {
  names(data)[3]<-paste("x") #renames 3rd column x to assist for loop
  names(data)[4]<-paste("y") #renames 4th column y to assist for loop
  data = data %>% 
    group_by(time) %>% 
    mutate(dist1 = sqrt((x[which(ref == 1)] - x)^2 + (y[which(ref == 1)] - y)^2)) %>% #distance beween all points and point 1
    mutate(dist2 = sqrt((x[which(ref == 2)] - x)^2 + (y[which(ref == 2)] - y)^2)) %>% #distance beween all points and point 2
    mutate(dist3 = sqrt((x[which(ref == 3)] - x)^2 + (y[which(ref == 3)] - y)^2)) %>% #distance beween all points and point 3
    mutate(dist4 = sqrt((x[which(ref == 4)] - x)^2 + (y[which(ref == 4)] - y)^2)) %>% #distance beween all points and point 4
    mutate(dist5 = sqrt((x[which(ref == 5)] - x)^2 + (y[which(ref == 5)] - y)^2)) %>% #distance beween all points and point 5
    mutate(dist6 = sqrt((x[which(ref == 6)] - x)^2 + (y[which(ref == 6)] - y)^2)) %>% #distance beween all points and point 6
    mutate(dist7 = sqrt((x[which(ref == 7)] - x)^2 + (y[which(ref == 7)] - y)^2)) %>% #distance beween all points and point 7
    mutate(dist8 = sqrt((x[which(ref == 8)] - x)^2 + (y[which(ref == 8)] - y)^2)) %>% #distance beween all points and point 8
    mutate(dist9 = sqrt((x[which(ref == 9)] - x)^2 + (y[which(ref == 9)] - y)^2)) %>% #distance beween all points and point 9
    mutate(dist10 = sqrt((x[which(ref == 10)] - x)^2 + (y[which(ref == 10)] - y)^2)) %>% #distance beween all points and point 10
    summarise(sumdistances = (sum(dist1,dist2,dist3,dist4,dist5,dist6,dist7,dist8,dist9,dist10))/2) #sum of all distances
  print(data$sumdistances)
}

when running this function on my df it calculates using only the first x and y but it works. resulting in a vector of length 2. the first value is for time 1, and the second is for time 2

> sumdist(df) # this calculates it from the original x and y 
[1] 2706.592 2275.045

step 4

I now want to repeat this across the multiple iterations I created earlier. For my actual data set, n will be in the thousands so I need to automate this process

sumd = matrix(NA, nrow=2, ncol=n+1) # set collection matrix for nrow = number of time and #ncol = number simulations

for(i in 1:(n+1)) {
  datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
  sumd[i] = sumdist(datas) # runs function on each simulated data set
}

because my function prints the calculated data at the end, running the code demonstrates that it has calculated what I want it to

> for(i in 1:(n+1)) {
+   datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
+   sumd[i] = sumdist(datas) # runs function on each simulated data set
+ }
[1] 2706.592 2275.045
[1] 2695.796 2282.284
[1] 2713.277 2288.517
[1] 2719.587 2273.316

the bottom 4 rows are what I want to calculate although not quite in this order

ideally it should look more like this

 time       V2       V3       V4       V5
1    1 2706.592 2695.796 2713.277 2719.587
2    2 2275.045 2282.284 2288.517 2273.316

Step 5

But half my matrix still contain NA and is filled like this:

> print(sumd)
         [,1]     [,2] [,3] [,4]
[1,] 2706.592 2713.277   NA   NA
[2,] 2695.796 2719.587   NA   NA

and the errors I receive are this

Warning messages:
1: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length
2: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length
3: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length
4: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length

Which seems straight forward as to what has gone wrong. the matrix I have created does not fit the output. I have tried altering the matrix in several ways so that it does fit, however I consistently receive the error, and ultimately cant seem to acquire a matrix or dataframe with the information I want.

Edit - I now understand the error in my initial code which prevents it from working which is naturally quite simple. sumd[i] should read sumd[,i]

The error message is quite clear: you are trying to put an object of some length into an object of a different length. Anyway, `for` loops are almost never the way to go in R.,I'll try to suggest something else. Could you explain a little how did you get to `xy1=2706.59`? Are you 100% sure of this result? From your algorithm, it is not clear the distance between which points you are trying to measure. — Dan Chaltiel, Jun 16 '20 at 12:04

Dan Chaltiel · Accepted Answer · 2020-06-18T12:43:49.680

OK, after your edit I realized that I was misunderstanding your problem.

I think the problem with your design is that you want to create the columns in advance. Obviously, they cannot have a proper name, which makes identifying x and y a bit difficult.

Here is my suggestion: add the Gaussian noise and calculate the sum on the fly.

First, let's recreate the dataframe (you could share this code or some dput output next time, it makes helping much easier).

library(tidyverse)
df = read.table(header=TRUE, text="
time ref     x     y
1     1   1 92.80 49.58
2     1   2 90.20 96.02
3     1   3 91.61 80.05
4     1   4 68.75 20.56
5     1   5  5.53 35.27
6     1   6 39.85 85.39
7     1   7 12.04 87.43
8     1   8 42.98 56.53
9     1   9 19.14 63.56
10    1  10 25.72  7.62
11    2   1 50.39  7.16
12    2   2 17.71  7.15
13    2   3 52.96 34.87
14    2   4 52.70 97.07
15    2   5 70.88 44.88
16    2   6 32.12 71.82
17    2   7 24.15 22.77
18    2   8 18.06 31.03
19    2   9 70.55 92.42
20    2  10 45.05 79.67")

Then, let's rewrite the distance calculation, as I found your code a bit redundant. Programming thumb rule: DRY. If you repeat a structure more then 3 times, you should probably write some functions.

options(dplyr.summarise.inform=FALSE) #don't care about those warnings
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2))
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred])    
distance_sum = function(x, y, ref){
    dists = map(1:10, ~distance2(x,y, which(ref == .x)))
    sum(unlist(dists))/2
}

Here, I could reproduce your results on x and y:

df %>% 
    group_by(time) %>% 
    summarise(sum=distance_sum(x, y, ref))
#> # A tibble: 2 x 2
#>    time   sum
#>   <int> <dbl>
#> 1     1 2707.
#> 2     2 2275.

Finally, we can replicate this a certain number of times, adding some random noise beforehand. Again, the resulting values are identical to yours.

set.seed(456)
n <- 3 #or 10000
xx = rerun(n, {
    df %>% 
        mutate(x=x+rnorm(length(x),0,1), 
               y=y+rnorm(length(y),0,1)) %>% 
        group_by(time) %>% 
        summarise(sum=distance_sum(x, y, ref)) %>% 
        as.data.frame() #needed for the precision in the example, you can drop this line
})
xx
#> [[1]]
#>   time      sum
#> 1    1 2695.796
#> 2    2 2282.284
#> 
#> [[2]]
#>   time      sum
#> 1    1 2713.277
#> 2    2 2288.517
#> 
#> [[3]]
#>   time      sum
#> 1    1 2719.587
#> 2    2 2273.316

You can then rbind the list and calculate some statistics on it:

xx %>% #this was run with n=25
    reduce(rbind) %>% 
    group_by(time) %>% 
    summarise(sum_m=mean(sum), sum_sd=sd(sum))
#> # A tibble: 2 x 3
#>    time sum_m sum_sd
#>   <int> <dbl>  <dbl>
#> 1     1 2711.   22.2
#> 2     2 2280.   16.8


Created on 2020-06-18 by the reprex package (v0.3.0)

I have rewritten the question entirely, hopefully it is clearer now, although I felt the original one was explained well, but from your answer, I evidently was not clear, if still not clear, I will try again. From your answer I don't think you understood that I had multiple iterations of my x and y coordinates which I hope is now explained in step 1. My sumdist function is right - hand calculations agreed to the 5th decimal point and aligned with normative data. so have kept it in but happy to change to more efficient code. and also happy to remove for loops if you can find a solution. — Stavrum, Jun 18 '20 at 11:06
@Stavrum ok I totally misunderstood your problem, so I totally rewrote my answer. I hope it helps. — Dan Chaltiel, Jun 18 '20 at 12:44
Thanks for the help, this has worked, but are you able to explain the following line of code `dists = map(1:10, ~distance2(x,y, which(ref == .x)))` . My understanding is the `~` and the `.x` are doing the work of identifying each reference point against all other reference points at each instant, although correct me if I am wrong. What I don't understand is that when I play about with the code, it only works with `.x` and not `.y` or `.z` for example. — Stavrum, Jun 19 '20 at 17:23
@Stavrum This is how `purrr::map` (true heir of `sapply`) works: the first argument is a list/vector to iterate on, and the second one is a function. `purrr` can use `lambda-functions` which are one-side formulas (starting with `~`), where the iteration is written `.x`. I could have written `dists = map(1:10, function(.x) {distance2(x,y, which(ref == .x))})` with the same effect, but I find `lambda-functions` quite neat. `.x` value is successively 1, 2, ..., 10. — Dan Chaltiel, Jun 20 '20 at 07:43

score 0 · Answer 2 · answered Jun 18 '20 at 17:58

df <- tibble(
  ref = rep(c(1, 2, 3), each = 5),
  x = rnorm(15, 10, 8),
  y = rnorm(15, 35, 20)
)

# Number of created points
n <- 3

# Putting x and y as point
df <- df %>%
  mutate(point = map2(x, y, c)) 

# Adding noise to point
new_points <- seq_len(n)
names(new_points) <- new_points %>% str_c("point_", .)
new_cols <- new_points %>%
  map(~list(rnorm(15), rnorm(15)) %>% transpose() %>% map(unlist)) %>%
  map(~map2(.x, df$point, ~.x+.y)) %>%
  as_tibble()

# Binding new points 
df <- df %>%
  bind_cols(new_cols)

# Functions for calculating euclidian distance of point list
dList <- function(a, b)
  b %>% 
    map_dbl(~(a - .x)^2 %>% sum() %>% sqrt())
sumDistanceList <- function(l)
  seq_len(length(l) - 1) %>%
    map(~dList(l[[.x]], l[(.x+1):length(l)])) %>%
    unlist() %>%
    sum()

# Summarise
df %>%
  group_by(ref) %>%
  summarise(across(str_subset(names(.), "point_"), sumDistanceList))

How to loop distance calculations for multiple instances using dplyr in r

2 Answers2