I have a data frame with observations in the following format:
(my actual data has more columns than just this, but for the sake of clarity, these are what are important)
head(sampleDF, 20)
Timestamp TimeIntoSession CorrelationGuid Position_x Position_z
1 11/22/2017 11:12:30 AM 1234.331 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.166 4.947
2 11/22/2017 11:12:30 AM 1234.397 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.155 4.902
3 11/22/2017 11:12:30 AM 1234.464 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.144 4.858
4 11/22/2017 11:12:30 AM 1234.547 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.133 4.810
5 11/22/2017 11:12:30 AM 1234.614 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.125 4.777
6 11/22/2017 11:12:30 AM 1234.697 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.118 4.741
7 11/22/2017 11:12:30 AM 1234.764 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.114 4.714
8 11/22/2017 11:12:30 AM 1234.847 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.114 4.683
9 11/22/2017 11:12:30 AM 1234.914 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.119 4.661
10 11/22/2017 11:12:30 AM 1234.997 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.128 4.639
11 11/22/2017 11:12:30 AM 327.341 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.105 4.099
12 11/22/2017 11:12:30 AM 327.480 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.104 4.100
13 11/22/2017 11:12:30 AM 327.557 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.100
14 11/22/2017 11:12:30 AM 327.640 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.099
15 11/22/2017 11:12:30 AM 327.723 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.099
16 11/22/2017 11:12:30 AM 327.807 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.098
17 11/22/2017 11:12:30 AM 327.890 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.102 4.097
18 11/22/2017 11:12:30 AM 327.957 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.101 4.096
19 11/22/2017 11:12:30 AM 328.040 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.099 4.095
20 11/22/2017 11:12:30 AM 328.123 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.096 4.094
For each row in a particular CorrelationGuid, I want to find the difference between the Euclidean norm of the position defined by the X & Z values in the current row, and those of the previous row.
I can do this for the whole data frame like this:
norm_vec <- function(x,y) sqrt(x^2 + y^2)
sampleMag<- mutate(sampleDF, sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))
But that gives the difference for every row; I want to do it within each CorrelationGuid; that is, I don't want the first row of a new CorrelationGuid to look at the last row of the previous CorrelationGuid when making the calculation.
I could try it for just one CorrelationGuid like this:
sampleMag<- mutate(sampleDF, sqMag = ifelse(CorrelationGuid == "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
(norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0))), NA))
But that's not really what I want; I want to do this for every CorrelationGuid, and not have NAs for all but one.
I can easily generate a list of the unique CorrelationGuid values using unique() or distinct(), but what is the best way to run the logic above once for each unique CorrelationGuid?
I could find the first and last instance of each CorrelationGuid, and then loop through it, but a for loop is going to be very slow here, particularly if this is done on a large dataset.
apply seems appropriate, but I'm not sure how best to use it here.