3

I am looking for a way to, within id groups, count unique occurrences of value shifts in TF in the data datatbl.

I want to count both forward and backwards from when TF changes between 1 and 0 or o and 1. The counting is to be stored in a new variable PM##, so that the PM##s holds each unique shift in TF, in both plus and minus. The MWE below leads to an outcome with 7 PM, but my production data can have 15 or more shifts. If a TF values does not change between NA's I want to mark it 0.

This question is similar to a question I previously asked, but the last part about TF standing alone is new. Both Uwe and Psidom provided elegant answers to the initial question using data.table here and using tidyverse here. after conferencing with Uwe, I am posting this slightly modified version of my question.

If this question violates any SO policies please let me know and I'll be happy to reopen my initial question or append this an bounty-issue.

To illustrate my question with a minimal working example. I have data like this,

what I have,

# install.packages(c("tidyverse"), dependencies = TRUE)
library(tibble)
tbl <- tibble(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)), 
       TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, NA, 1, 0, 1, 0, 1, NA, 0L, NA, 0L, 
       0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
tbl %>% print(n=18)
#> # A tibble: 40 x 2
#>       id    TF
#>    <int> <dbl>
#>  1    10    NA
#>  2    10    NA
#>  3    10     0
#>  4    10    NA
#>  5    10     0
#>  6    10    NA
#>  7    10     1
#>  8    10     1
#>  9    10     1
#> 10    10     1
#> 11    10     1
#> 12    10    NA
#> 13    10     1
#> 14    10     0
#> 15    10     1
#> 16    10     0
#> 17    10     1
#> 18     0    NA
#> # ... with 22 more rows

what I am trying to obtain,

tblPM <- structure(list(id = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, 
NA, 1, 0, 1, 0, 1, NA, 0, NA, 0, 0, 1, 1, 1, 0, 0, 
NA, NA, 0, NA, 0, 0, 0, 1, 1, 1, 0, NA, 1), PM01 = c(NA, 
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, 
-2L, -1L, 1L, 2L, 3L, NA, NA, NA), PM02 = c(NA, NA, NA, NA, 0L, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -2L, 
-1L, 1L, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L, 
-1L, 1L, NA, NA), PM03 = c(NA, NA, NA, NA, NA, NA, 0L, 0L, 0L, 
0L, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L, 
-1L, 1L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
0L), PM04 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
-1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), PM05 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), PM06 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA), PM07 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), .Names = c("id", "TF", "PM01", "PM02", "PM03", "PM04", "PM05", 
"PM06", "PM07"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -40L
))


tblPM %>% print(n=18)  
#> # A tibble: 40 x 9
#>       id    TF  PM01  PM02  PM03  PM04  PM05  PM06  PM07
#>    <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
#>  1    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  2    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  3    10     0     0    NA    NA    NA    NA    NA    NA
#>  4    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  5    10     0    NA     0    NA    NA    NA    NA    NA
#>  6    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  7    10     1    NA    NA     0    NA    NA    NA    NA
#>  8    10     1    NA    NA     0    NA    NA    NA    NA
#>  9    10     1    NA    NA     0    NA    NA    NA    NA
#> 10    10     1    NA    NA     0    NA    NA    NA    NA
#> 11    10     1    NA    NA     0    NA    NA    NA    NA
#> 12    10    NA    NA    NA    NA    NA    NA    NA    NA
#> 13    10     1    NA    NA    NA    -1    NA    NA    NA
#> 14    10     0    NA    NA    NA     1    -1    NA    NA
#> 15    10     1    NA    NA    NA    NA     1    -1    NA
#> 16    10     0    NA    NA    NA    NA    NA     1    -1
#> 17    10     1    NA    NA    NA    NA    NA    NA     1
#> 18     0    NA    NA    NA    NA    NA    NA    NA    NA
#> # ... with 22 more rows 

identical([some solution], tblPM)
#> [1] TRUE

update w/ microbenchmark 2018-01-24 14:20:18Z,

Thanks to Fierr and Chris for taking the time to tease out the logic and submit an answer. Inspired my this setup I've computed a small microbenchmark comparison of thier functions. I put Fierrs answer into the functiontidyverse_Fierr()and Chris' answer intodt_Chris()` (if someone want the exact functions please let me know and I'll add them here.

After some minor tweaks they both come out identical when match with tblPM, i.e.

identical(tblPM, tidyverse_Fierr(tbl))
#> [1] TRUE
identical(tblPM, dt_Chris(tbl))
#> [1] TRUE

Now to the quick microbenchmark,

df_test <- bind_rows(rep(list(tbl), 111))
microbenchmark::microbenchmark(tidyverse_Fierr(df_test), dt_Chris(df_test), times = 3*1)
#> Unit: milliseconds
#>                      expr      min       mean   median        uq         max neval cld
#> tidyverse_Fierr(df_test) 19503.366  20171.268 20080.99 20505.219  20929.4489     3   b
#>        dt_Chris(df_test)   199.165    233.924   203.72   251.304    298.8887     3   a 

Interestingly the tidy_method comes out way faster in this kinda similar comparison.

Eric Fail
  • 8,191
  • 8
  • 72
  • 128
  • What is the expected result in case of `tbl <- tibble(id = c(rep(0L, 13L), rep(1L, 10L)), TF = c(NA, 0L, NA, 0L, 0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L)) ` ? – Uwe Jan 17 '18 at 18:56
  • @Uwe, Good question! I've tried to illustrate how I imagine the logic would extend to in the case you put forward by updating my question. Please don't hesitate to chime in if something seems off. I could have overlooked something. – Eric Fail Jan 17 '18 at 20:23
  • @Uwe, I've updated my question a second time, to take account for the ambiguities you point out from the top and to add even some more complex situations in an `id` 10. – Eric Fail Jan 18 '18 at 15:44
  • I'm not sure if I understand your output example from `PM04` to `PM07`. It's always `-1, 1` in the same direction - maybe you can talk through rows 13 - 15 in more detail? – RolandASc Jan 22 '18 at 10:26
  • Thank you for your question! I will be happy to. From row `13` to `14` there is a shift in `TF` from `1` to `0`, then in row `14` `TF` _goes back_ to `1`, i.e. it's a one step shift in both directions. As row `13` is only _one step_ away from _the shift_, row `13` gets an `-1`. Row `14` gets an one, `1`, as this is _one away_ form the shift (this is stored in column `PM04` i `tblPM`). Next shift is now from row `14` to `15`, where row `14` is now `-1`, _away from the shift_, and row `15` is now `1` _away from the shift_. This pattern continues to row `17`. Does this answer your question? – Eric Fail Jan 22 '18 at 11:04
  • The logic might be easier to follow in row `21` to `27`, i.e. `tblPM[21:27,]` where I have some longer spells of `1`s and `0`s. This create a longer countdown and count_away_ from the shift. I appreciate your ask! – Eric Fail Jan 22 '18 at 11:06
  • Would you mind commenting `tblPM[15,6]` ? – MrSmithGoesToWashington Jan 22 '18 at 13:41
  • @MrSmithGoesToWashington, thank you for your question. `tblPM[15,6]` is `NA` as `TF`'s value _no longer_ is unique. That is, I want to count occurrences of unique value-shift in `TF`, and since `tblPM[13,2]` is equal to `1`, `tblPM[14,2]` is equal to `0` the shift from `0` to `1` between `tblPM[14,2]` and `tblPM[15,2]` is not unique. Hence `PM04` ends at `tblPM[14,6]` and `tblPM[15,2]` is equal to `NA`. Does that answer your question? (please don't hesitate to let me know if you see anything inconsistent in my reasoning. I could have confused things or made a typo) – Eric Fail Jan 22 '18 at 15:12
  • doesn't make sense to me. For me PM05 should be 1,-1 (change TF 0 to 1, and then 1 to 0, lines 14 to 15, and then 15 to 16) and PM07 should be 1,NA as TF changes from 0 to 1, and 1 to NA (lines 16 to 17, and 17 to 18) – denis Jan 22 '18 at 15:28
  • @denis, thank you for your comment. I should have made it clearer that it is not about he value of `TF`, but merely the shifts in value. Whenever there's a shift I want to count backwards, i.e. `-…, -3, -2,-1` _up to that shift_ and again count away from the shift, , i.e. `1, 2, 3, …`. The logic might be easier to follow in row `21` to `27`, i.e. `tblPM[21:27,]` where I have some longer spells of `1`s and `0`s. Does stat make sense? – Eric Fail Jan 22 '18 at 15:34
  • I'm still lost .. why a différence of two between `tblPM[22,4]` and `tblPM[23,4]` ? – MrSmithGoesToWashington Jan 22 '18 at 15:40
  • @MrSmithGoesToWashington, Good point. You are right. And there is no _difference of two_ between any of the unique value shifts in `TF`. I've however chosen to start the counting on both sides of the shifts at `1` and `-1` respectively. Zero is deliberately not in the counting as I’ve reserved it to indicate series of values in `TF` that does not change, e.g. the series of `1`s from `tblPM[7,2]` to `tblPM[11,2]` generate the string of `0`s from `tblPM[7,5]` to `tblPM[11,5]` in `PM03`. – Eric Fail Jan 22 '18 at 15:53

2 Answers2

2

Here is a script approach - given the amount of custom treatment for each case (TF = NA, uniqueN(TF) = 1, uniqueN(TF) = 2, I think this is likely clearer to implement vs. a dplyr chain. Should be fairly quick as it is all data.table based. Open to suggestions on how to improve!

This will expand automatically as the number of PM columns required increases - as I commented below, I would recommend getting rid of the 0 prefix in the column, as there may be a case where you get to 10^2..n columns which would bump to PM001.

library(data.table)
tbl3 <- data.table(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)), 
                   TF = c(NA, NA, 0L, NA, 0L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 0L, 1L, 0L, 1L, NA, 0L, NA, 0L, 
                          0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))


# create index to untimately join back to
tbl3[, row_idx := .I]

# all transformations on a replicated data.table
tbl3_tmp <- copy(tbl3)

# identify where the NA breaks occur - this splits each id into subgroups (id_group)
tbl3_tmp[, P_TF := shift(TF, 1, "lag", fill = NA), by = .(id)]
tbl3_tmp[, TF_break := is.na(TF) | is.na(P_TF)]
tbl3_tmp[, id_group := cumsum(TF_break), by = .(id)]

tbl3_tmp[, `:=`(TF_break = NULL, P_TF = NULL)] # above can be consolidated to one line which would make this line unneccesary - expanded for easier understanding
tbl3_tmp <- tbl3_tmp[!is.na(TF)] # NA rows can be safely ignored now - these will be all NA, and will be handled with the left join below

# find where subpatterns exist (runs of 0..1 or 1..0)
tbl3_tmp[, subpattern_break := TF != shift(TF, 1, "lag", fill = NA), by = .(id, id_group)]
tbl3_tmp[, subbreaks := sum(subpattern_break, na.rm = TRUE), by = .(id, id_group)] # if there are no breaks, we need to treat separately

# two cases: zero subbreaks and multiple subbreaks. 
tbl3_zeros <- tbl3_tmp[subbreaks == 0]
tbl3_nonzeros <- tbl3_tmp[subbreaks > 0]

# for 1+ subbreaks, we need to double the rows - this allows us to easily create the PM_field both "forwards" and "backwards"
tbl3_nonzeros[is.na(subpattern_break), subpattern_break := TRUE]
tbl3_nonzeros[, subbreak_index := cumsum(subpattern_break), by = .(id, id_group)]

tbl3_nonzeros <- rbindlist(list(tbl3_nonzeros,tbl3_nonzeros), idcol = "base") # double the row

tbl3_nonzeros[base == 1 & subbreak_index %% 2 == 1, subbreak_index := subbreak_index + 1L] # round to nearest even
tbl3_nonzeros[base == 2 & subbreak_index %% 2 == 0, subbreak_index := subbreak_index + 1L] # round to nearest odd

# this creates an index when the subbreak starts - allows us to sequence PM properly
tbl3_nonzeros[,subbreak_start := min(row_idx), by = .(id, id_group, subbreak_index)]

# exclude the ends if there is only one unique TF value - might be able to get this to one line
tbl3_nonzeros[, TF_count := uniqueN(TF), by = .(id, id_group, subbreak_index)]
tbl3_nonzeros <- tbl3_nonzeros[TF_count > 1]

# create a 1..N column, subtract the index where the break occurs ,then add 1 to all 0+ values.
tbl3_nonzeros[,PM_field := 1:.N, by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[, PM_field := PM_field - PM_field[which(diff(TF)!=0)[1]+1], by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[PM_field >= 0, PM_field := PM_field + 1L] # base 1 after the break

# create subbreaks for zero groups
tbl3_zeros[,subbreak_start := min(row_idx), by = .(id, id_group)]

# bring zero and non zero case together
tbl3_zeros <- tbl3_zeros[, .(id, id_group, subbreak_start,row_idx = row_idx, PM_field = 0L)]
tbl3_nonzeros <- tbl3_nonzeros[,.(id, id_group, subbreak_start, row_idx, PM_field)]
tbl3_tmp <- rbindlist(list(tbl3_zeros, tbl3_nonzeros))

# Create header
tbl3_tmp <- tbl3_tmp[order(subbreak_start, PM_field)] 
tbl3_tmp[, PM_header := paste0("PM0",cumsum(c(1,diff(subbreak_start)!=0)),sep = ""), by = .(id)] # I would remove 0 in PM0 here (kept for identical check)- inefficient to check if this will be 1, 2, 3 etc digits This could also be solved with; `paste0("PM", sprintf("%02d", cumsum(c(1, diff(subbreak_start) != 0))))`

# long to wide
tbl3_tmp <- dcast(tbl3_tmp, row_idx ~ PM_header, value.var = "PM_field", fun.aggregate = sum, fill = NA)

# merge back to initial dataframe
tblPM_frombase <- merge(tbl3, tbl3_tmp, by = "row_idx", all.x = TRUE)[, row_idx := NULL]

identical(tblPM, tblPM_frombase)
[1] TRUE
Eric Fail
  • 8,191
  • 8
  • 72
  • 128
Chris
  • 6,302
  • 1
  • 27
  • 54
  • Thank you for taking the time to write up a very transparent answer. It is very educational. Thanks! I've added a small update to my question with a `microbenchmark` comparison of the different answers. I think you will find it interesting. – Eric Fail Jan 24 '18 at 14:28
  • @EricFail as I read your benchmarks this method is the fastest, no? Median runttime of 203 vs. tidy (403) and tidyverse (20,080) – Chris Jan 24 '18 at 14:36
  • Yes, this method, i.e. `dt_Chris`, is the fastest. The benchmark is however only comparing two methods. This one and `tidyverse_Fierr`, with median runtimes of `233.9249` and `20171.2683` milliseconds respectively. In other words, this method only takes approximately 1.16% of the time the `tidyverse` methods uses. As I note above, I do find it interesting that the `tidy_method` comes out way faster in this [kinda similar comparison (see at the bottom of the answer)](https://stackoverflow.com/a/46871196/1305688). – Eric Fail Jan 24 '18 at 14:52
  • @EricFail you should add the `tidy_method` to your benchmark above. I think my method should still be faster, but it also may not be a fair comparison because the `tidy_method` does not pass the `identical` test – Chris Jan 24 '18 at 15:54
  • Exactly. I think we should keep this question focused on the methods the passes the `identical` test. That's kinda why I open the benchmark comparison by running them. I would be very interested to see the `tidy_method` rewritten to pass the test and then run it though the `microbenchmark`, but I don't think that gonna happen right now. – Eric Fail Jan 24 '18 at 16:14
1

Liked the challenge to uncover the logic of this one. The approach is based on tidyverse. Suggestions on tidying it even more are welcome!

library(data.table)
library(purrr)
library(dplyr)
library(tibble)

tbl <- tibble(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)), 
              TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, NA, 1, 0, 1, 0, 1, NA, 0L, NA, 0L, 
                     0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))

tbl <- mutate(tbl, rn = 1:n())

lookup_table <- tbl %>%
  group_by(id) %>%
  mutate(rl         = rleid(TF)) %>%
  group_by(id, rl, TF) %>%
  summarise(n=n()) %>%
  group_by(id) %>%
  mutate(lag        = lag(TF, order_by=id),
         lead       = lead(TF, order_by=id),
         test       = ifelse(is.na(lag) & is.na(lead), 1, 0)) %>%
  select(id, rl, test)

tmp <- tbl %>%
  group_by(id) %>%
  mutate(rl         = rleid(TF),
         rl_nona    = ifelse(is.na(TF), NA, rleid(rl)),
         rl_nona    = match(rl_nona, unique(na.omit(rl_nona)))) %>%   # Re-indexing
  left_join(lookup_table, by = c("id" = "id", "rl" = "rl")) %>%
  mutate(TF_new     = ifelse(test == 1, NA, TF),
         rl_gap     = ifelse(is.na(TF_new), NA, rleid(TF_new)),
         rl_gap     = match(rl_gap, unique(na.omit(rl_gap))),         # Re-indexing
         up_pos     = ifelse(min(rl_gap, na.rm=TRUE)==rl_gap, NA, rl_gap),
         down_pos   = ifelse(max(rl_gap, na.rm=TRUE)==rl_gap, NA, rl_gap)) %>% 
  group_by(id, rl_gap) %>%
  mutate(up         = ifelse(is.na(up_pos), 0, seq_len(n())),
         down       = ifelse(is.na(down_pos), 0, -rev(seq_len(n())))) %>%
  group_by(id) %>%
  mutate(zero_pos   = ifelse(test == 1 & rl_nona > max(rl_gap, na.rm = TRUE), rl_nona - 1, rl_nona)) # Correct placement of zeroes

up   <- dcast(tmp, rn ~ rl_nona, value.var = 'up'  , fill = 0)
down <- dcast(tmp, rn ~ rl_nona, value.var = 'down', fill = 0)

res <- (down[, 2:max(tmp$rl_nona, na.rm=TRUE)] + up[, 3:(max(tmp$rl_nona, na.rm=TRUE)+1)]) %>%
  mutate_all(funs(replace(., which(.==0), NA))) %>%
  bind_cols(rn = tmp$rn, test = tmp$test, zero_pos = tmp$zero_pos) %>%
  right_join(tbl, by = "rn") %>%
  mutate(`PM01` = ifelse(test == 1 & zero_pos == 1, 0, `1`)) %>%
  mutate(`PM02` = ifelse(test == 1 & zero_pos == 2, 0, `2`)) %>%
  mutate(`PM03` = ifelse(test == 1 & zero_pos == 3, 0, `3`)) %>%
  mutate(`PM04` = ifelse(test == 1 & zero_pos == 4, 0, `4`)) %>%
  mutate(`PM05` = ifelse(test == 1 & zero_pos == 5, 0, `5`)) %>%
  mutate(`PM06` = ifelse(test == 1 & zero_pos == 6, 0, `6`)) %>%
  mutate(`PM07` = ifelse(test == 1 & zero_pos == 7, 0, `7`)) %>%
  select(id, TF, everything(), -rn, -test, -zero_pos, -c(1:7)) %>%
  mutate_if(is.numeric, as.integer) %>%
  as.tibble()

identical(tblPM, res)
Fierr
  • 185
  • 2
  • 10
  • Thank you for your answer. I appreciate you took the time to tease out the logic. Did you see the [_tidy method_ posted by Psidom here](https://stackoverflow.com/a/46871196/1305688) that does something similar? I imagine his `forward_count` could be recycled in your tidyverse approach. Psidom’s approach is, in addition, not restricted to `PM07`, which could become a bit of an issue with the production data. Regardless, I very much appreciate you took the time to write an answer. Let's work to optimize it; possibly using @Psidom’s work. – Eric Fail Jan 22 '18 at 18:03
  • I've added a small update to my question with a `microbenchmark` comparison of the different answers. Kinda surprising as to how different to the two answers comes out. In particular compared to the kinda similar [comparison Psidom provided at the bottom of his answer](https://stackoverflow.com/a/46871196/1305688). – Eric Fail Jan 24 '18 at 14:30