I have positional data, an example of which is shown below, where time
is the time each position was recorded, ref
is the reference of each point, x
is the x coordinate for each point and y
is the y coordinate for each point.
> print(df)
time ref x y
1 1 1 92.80 49.58
2 1 2 90.20 96.02
3 1 3 91.61 80.05
4 1 4 68.75 20.56
5 1 5 5.53 35.27
6 1 6 39.85 85.39
7 1 7 12.04 87.43
8 1 8 42.98 56.53
9 1 9 19.14 63.56
10 1 10 25.72 7.62
11 2 1 50.39 7.16
12 2 2 17.71 7.15
13 2 3 52.96 34.87
14 2 4 52.70 97.07
15 2 5 70.88 44.88
16 2 6 32.12 71.82
17 2 7 24.15 22.77
18 2 8 18.06 31.03
19 2 9 70.55 92.42
20 2 10 45.05 79.67
the steps I want to take are as follows (steps 1 to 4 are completed successfully)
- replicate the x and y coordinates multiple times with small errors
- calculate the distance between every point at each instant of time
- calculate the sum of these 45 distances for each instance of time
- repeat this process across all the different iterations I created in step 1
- create a new dataframe with all this information on it
step 1.
set.seed(456) #set seed to get consistent results
n <- 3 # this is 3 for this example but would likely be 1000 or 10000 and refers to the number of simulations
for(i in seq(5,(2*n+3),2)){ #create simulations of the xy data set
df[,i] = df[,3] + rnorm(length(df[,2]),0,1) #replicates the x column
df[,i+1] = df[,4] + rnorm(length(df[,3]),0,1) # replicates the y column
}
This code works and is easily adjustable and gives me the following df. The first 4 columns are exactly the same as above. V5 and V6 are the x and y coordinates for n=1 which has a small error from the original x and y (you can see how similar these values are) V7 and V8 are x and y for n=2 and V9 and V10 are x and y for n=3
print(df)
time ref x y V5 V6 V7 V8 V9 V10
1 1 1 92.80 49.58 91.456479 49.105396 92.771058 47.325290 91.720518 49.698151
2 1 2 90.20 96.02 90.821776 94.302691 90.593037 95.037940 89.758626 96.889903
3 1 3 91.61 80.05 92.410875 78.623170 91.360386 79.849432 93.630635 79.958064
4 1 4 68.75 20.56 67.361108 20.768236 68.833450 21.455930 68.822856 20.628899
5 1 5 5.53 35.27 4.815643 35.234164 7.608875 35.226455 6.238817 33.587573
6 1 6 39.85 85.39 39.525939 86.524285 39.970852 87.037308 40.700509 86.506956
7 1 7 12.04 87.43 12.730643 86.967145 12.158149 88.993299 10.553803 86.078642
8 1 8 42.98 56.53 43.230548 56.201616 43.750054 55.098622 43.900530 55.992833
9 1 9 19.14 63.56 20.147352 65.044539 17.964598 63.015406 19.288329 63.189886
10 1 10 25.72 7.62 26.293235 6.530622 26.129039 6.848746 25.483132 7.974012
11 2 1 50.39 7.16 49.474189 6.631206 49.725049 6.990012 49.916764 6.350175
12 2 2 17.71 7.15 19.021097 6.556207 17.453475 7.109238 17.040794 6.970275
13 2 3 52.96 34.87 53.948726 32.871084 53.638782 33.149460 54.318527 33.722340
14 2 4 52.70 97.07 54.353929 97.366153 53.596845 98.514106 54.112918 97.166242
15 2 5 70.88 44.88 69.439195 45.050625 71.498356 44.859985 70.147226 45.694700
16 2 6 32.12 71.82 34.067356 73.635652 32.851454 72.090232 32.039448 72.802941
17 2 7 24.15 22.77 25.886936 22.109397 23.736825 22.657066 24.960197 23.620843
18 2 8 18.06 31.03 18.447483 30.889748 19.617813 30.175112 18.562588 32.237347
19 2 9 70.55 92.42 72.830034 91.996021 71.091699 91.386259 71.674023 90.986222
20 2 10 45.05 79.67 46.587883 79.631264 45.627150 79.892027 44.878720 78.569054
step 2
I have created code using dplyr which groups the data by time and then calculates the distance between each reference point (this code is shown in step 3). there are 10 reference points which result in 45 distances to be calculated (10 choose 2).
step 3 for each group of time, I want to calculate the sum of all 45 distances. steps 2 and 3 are in the following code which has been made into a function
sumdist = function(data) {
names(data)[3]<-paste("x") #renames 3rd column x to assist for loop
names(data)[4]<-paste("y") #renames 4th column y to assist for loop
data = data %>%
group_by(time) %>%
mutate(dist1 = sqrt((x[which(ref == 1)] - x)^2 + (y[which(ref == 1)] - y)^2)) %>% #distance beween all points and point 1
mutate(dist2 = sqrt((x[which(ref == 2)] - x)^2 + (y[which(ref == 2)] - y)^2)) %>% #distance beween all points and point 2
mutate(dist3 = sqrt((x[which(ref == 3)] - x)^2 + (y[which(ref == 3)] - y)^2)) %>% #distance beween all points and point 3
mutate(dist4 = sqrt((x[which(ref == 4)] - x)^2 + (y[which(ref == 4)] - y)^2)) %>% #distance beween all points and point 4
mutate(dist5 = sqrt((x[which(ref == 5)] - x)^2 + (y[which(ref == 5)] - y)^2)) %>% #distance beween all points and point 5
mutate(dist6 = sqrt((x[which(ref == 6)] - x)^2 + (y[which(ref == 6)] - y)^2)) %>% #distance beween all points and point 6
mutate(dist7 = sqrt((x[which(ref == 7)] - x)^2 + (y[which(ref == 7)] - y)^2)) %>% #distance beween all points and point 7
mutate(dist8 = sqrt((x[which(ref == 8)] - x)^2 + (y[which(ref == 8)] - y)^2)) %>% #distance beween all points and point 8
mutate(dist9 = sqrt((x[which(ref == 9)] - x)^2 + (y[which(ref == 9)] - y)^2)) %>% #distance beween all points and point 9
mutate(dist10 = sqrt((x[which(ref == 10)] - x)^2 + (y[which(ref == 10)] - y)^2)) %>% #distance beween all points and point 10
summarise(sumdistances = (sum(dist1,dist2,dist3,dist4,dist5,dist6,dist7,dist8,dist9,dist10))/2) #sum of all distances
print(data$sumdistances)
}
when running this function on my df it calculates using only the first x and y but it works. resulting in a vector of length 2. the first value is for time 1, and the second is for time 2
> sumdist(df) # this calculates it from the original x and y
[1] 2706.592 2275.045
step 4
I now want to repeat this across the multiple iterations I created earlier. For my actual data set, n will be in the thousands so I need to automate this process
sumd = matrix(NA, nrow=2, ncol=n+1) # set collection matrix for nrow = number of time and #ncol = number simulations
for(i in 1:(n+1)) {
datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
sumd[i] = sumdist(datas) # runs function on each simulated data set
}
because my function prints the calculated data at the end, running the code demonstrates that it has calculated what I want it to
> for(i in 1:(n+1)) {
+ datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
+ sumd[i] = sumdist(datas) # runs function on each simulated data set
+ }
[1] 2706.592 2275.045
[1] 2695.796 2282.284
[1] 2713.277 2288.517
[1] 2719.587 2273.316
the bottom 4 rows are what I want to calculate although not quite in this order
ideally it should look more like this
time V2 V3 V4 V5
1 1 2706.592 2695.796 2713.277 2719.587
2 2 2275.045 2282.284 2288.517 2273.316
Step 5
But half my matrix still contain NA and is filled like this:
> print(sumd)
[,1] [,2] [,3] [,4]
[1,] 2706.592 2713.277 NA NA
[2,] 2695.796 2719.587 NA NA
and the errors I receive are this
Warning messages:
1: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
2: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
3: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
4: In sumd[i] <- sumdist(datas) :
number of items to replace is not a multiple of replacement length
Which seems straight forward as to what has gone wrong. the matrix I have created does not fit the output. I have tried altering the matrix in several ways so that it does fit, however I consistently receive the error, and ultimately cant seem to acquire a matrix or dataframe with the information I want.
Edit - I now understand the error in my initial code which prevents it from working which is naturally quite simple. sumd[i]
should read sumd[,i]