0

I have a data frame with observations in the following format:

(my actual data has more columns than just this, but for the sake of clarity, these are what are important)

head(sampleDF, 20)
               Timestamp TimeIntoSession                      CorrelationGuid Position_x Position_z
1  11/22/2017 11:12:30 AM        1234.331 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.166      4.947
2  11/22/2017 11:12:30 AM        1234.397 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.155      4.902
3  11/22/2017 11:12:30 AM        1234.464 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.144      4.858
4  11/22/2017 11:12:30 AM        1234.547 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.133      4.810
5  11/22/2017 11:12:30 AM        1234.614 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.125      4.777
6  11/22/2017 11:12:30 AM        1234.697 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.118      4.741
7  11/22/2017 11:12:30 AM        1234.764 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.114      4.714
8  11/22/2017 11:12:30 AM        1234.847 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.114      4.683
9  11/22/2017 11:12:30 AM        1234.914 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.119      4.661
10 11/22/2017 11:12:30 AM        1234.997 714e8a89-91a5-415b-b102-6ed5c0cf9f44      5.128      4.639
11 11/22/2017 11:12:30 AM         327.341 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.105      4.099
12 11/22/2017 11:12:30 AM         327.480 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.104      4.100
13 11/22/2017 11:12:30 AM         327.557 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.103      4.100
14 11/22/2017 11:12:30 AM         327.640 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.103      4.099
15 11/22/2017 11:12:30 AM         327.723 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.103      4.099
16 11/22/2017 11:12:30 AM         327.807 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.103      4.098
17 11/22/2017 11:12:30 AM         327.890 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.102      4.097
18 11/22/2017 11:12:30 AM         327.957 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.101      4.096
19 11/22/2017 11:12:30 AM         328.040 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.099      4.095
20 11/22/2017 11:12:30 AM         328.123 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6      3.096      4.094

For each row in a particular CorrelationGuid, I want to find the difference between the Euclidean norm of the position defined by the X & Z values in the current row, and those of the previous row.

I can do this for the whole data frame like this:

norm_vec <- function(x,y) sqrt(x^2 + y^2)
sampleMag<- mutate(sampleDF, sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))

But that gives the difference for every row; I want to do it within each CorrelationGuid; that is, I don't want the first row of a new CorrelationGuid to look at the last row of the previous CorrelationGuid when making the calculation.

I could try it for just one CorrelationGuid like this:

sampleMag<- mutate(sampleDF, sqMag = ifelse(CorrelationGuid == "714e8a89-91a5-415b-b102-6ed5c0cf9f44", 
                                        (norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0))), NA))

But that's not really what I want; I want to do this for every CorrelationGuid, and not have NAs for all but one.

I can easily generate a list of the unique CorrelationGuid values using unique() or distinct(), but what is the best way to run the logic above once for each unique CorrelationGuid?

I could find the first and last instance of each CorrelationGuid, and then loop through it, but a for loop is going to be very slow here, particularly if this is done on a large dataset.

apply seems appropriate, but I'm not sure how best to use it here.

TMIB
  • 3
  • 2
  • I wonder if you want to use `group_by(CorrelationGuid)`. – jazzurro Jan 18 '18 at 00:17
  • That certainly seems like the right first step; I've been playing around with that for a bit, but I'm confused as to how to then apply my calculations to the grouped variables. – TMIB Jan 18 '18 at 00:23
  • What happens if you write `group_by(sampleDF, CorrelationGuid) %>% mutate(sampleDF, sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))`? – jazzurro Jan 18 '18 at 00:26
  • `Error in eval(substitute(expr), envir, enclos) : impossible to replicate vector of size 5 ` I think you are on the right track here. It seems like a combination of group_by and do() or apply() might be the key. The syntax is eluding me though. – TMIB Jan 18 '18 at 00:30
  • Yeah you definitely need `group_by()`. It seems to me that this is a great opportunity for you to learn how to use the function. You then probably have to modify something in the mutate part. – jazzurro Jan 18 '18 at 00:33
  • `group_by(sampleDF, CorrelationGuid) %>% mutate(sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))` is working. Is this what you want? The previous code I sent you had `sampleDF` in `mutate()'. That is why the code did not work. – jazzurro Jan 18 '18 at 00:44
  • (deleted my comment, as I just posted the same thing you did.) Thanks! – TMIB Jan 18 '18 at 00:49
  • This was a good chance for you to learn `group_by()`. I hope you get used to it. – jazzurro Jan 18 '18 at 00:51

1 Answers1

0

Just to put a bow on all the discussion in the comments I believe you are looking for what @jazzurro provided...

sampleDF <- sampleDF %>% group_by(CorrelationGuid) %>% 
    mutate(sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))
sampleDF

with the following from dput to reproduce the data

structure(list(CorrelationGuid = c("714e8a89-91a5-415b-b102-6ed5c0cf9f44", 
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44", 
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44", 
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44", 
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44", 
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", 
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", 
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", 
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", 
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", 
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6"), Position_x = c(5.166, 
5.155, 5.144, 5.133, 5.125, 5.118, 5.114, 5.114, 5.119, 5.128, 
3.105, 3.104, 3.103, 3.103, 3.103, 3.103, 3.102, 3.101, 3.099, 
3.096), Position_z = c(4.947, 4.902, 4.858, 4.81, 4.777, 4.741, 
4.714, 4.683, 4.661, 4.639, 4.099, 4.1, 4.1, 4.099, 4.099, 4.098, 
4.097, 4.096, 4.095, 4.094), sqMag = c(7.15264741197272, -0.0390246359183104, 
-0.0382499943559047, -0.0409013019070033, -0.0283774181905558, 
-0.0296332826672581, -0.0212618601391981, -0.0209732951217187, 
-0.0111423510261082, -0.00812087217042912, 5.1422588421821, 0.000193491092365328, 
-0.000603541254226236, -0.000797343143268847, 0, -0.000797272282198946, 
-0.00140089984128977, -0.00140089254591658, -0.0020043998541075, 
-0.00260744523674017)), .Names = c("CorrelationGuid", "Position_x", 
"Position_z", "sqMag"), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -20L), vars = "CorrelationGuid", labels = structure(list(
    CorrelationGuid = c("22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", 
    "714e8a89-91a5-415b-b102-6ed5c0cf9f44")), class = "data.frame", row.names = c(NA, 
-2L), vars = "CorrelationGuid", drop = TRUE, .Names = "CorrelationGuid"), indices = list(
    10:19, 0:9), drop = TRUE, group_sizes = c(10L, 10L), biggest_group_size = 10L)
Chuck P
  • 3,862
  • 3
  • 9
  • 20