0
id  current stage  previous stages
1      06              05
1      06              03

2     04               03
2     04               02

suppose there are 5 stages of an id.(02,03 etc) An id should goes through each of the stages. Here in example Id num 1 skips 04 and 02 stage but id num 2 passes through all.so it should be current stage -1 and -2 etc...

i have to identify such ids which skips stages. need to do it R or hadoop query.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
umakant
  • 49
  • 10

1 Answers1

1

If I understood the question correctly then you can try below dplyr solution.

library(dplyr)

df %>%
  group_by(id, current_stage) %>%
  summarise(all_prev_stages = paste(sort(previous_stages, decreasing = T), collapse = ",")) %>%
  mutate(posible_prev_stages = paste(seq(current_stage-1, 2), collapse = ",")) %>%
  filter(all_prev_stages != posible_prev_stages) %>%
  select(id)

This gives the list of ids which skip stages (i.e. id = 1 in your sample data):

     id
1     1

Sample data:

df <- structure(list(id = c(1L, 1L, 2L, 2L), current_stage = c(6L, 
6L, 4L, 4L), previous_stages = c(5L, 3L, 3L, 2L)), .Names = c("id", 
"current_stage", "previous_stages"), class = "data.frame", row.names = c(NA, 
-4L))
Prem
  • 11,775
  • 1
  • 19
  • 33
  • Thanks a lot for the prompt response.I will test and let you know for sure.Hats off – umakant Feb 06 '18 at 10:13
  • Hi Prem,all working fine except : paste(seq(current_stage-1, 2) .i also need to include the current stage in it like if current is 04, previous should be 04,03,02 ...sorry for not stating it earlier. – umakant Feb 06 '18 at 10:47
  • It won't matter. In case you want to include current stage in `posible_prev_stages` then you'll also have to include it in `all_prev_stages` to get the desired result and also you seems not to be interested in these intermediate columns but only the list of ids. Correct me if I am wrong. – Prem Feb 06 '18 at 10:57
  • thanks for the response.Yes i need to include it in possible_prev_stages.otherwise it would show wrong result. what should be added here ? paste(seq(stage-1, 2) like if current stage is 5, all possible_prev_stages should be (5,4,3,2) – umakant Feb 06 '18 at 11:24
  • Simply replace above code with `all_prev_stages = paste(sort(c(unique(current_stage), previous_stages), decreasing = T), collapse = ",")` and `posible_prev_stages = paste(seq(current_stage, 2), collapse = ",")` – Prem Feb 06 '18 at 11:32
  • 1
    Adding only "paste(seq(current_stage, 2)" worked ...thanks a lot for this assistance..great work indeed...u understood what i didnt put clear here.Kudos – umakant Feb 06 '18 at 11:56
  • how can i do it? – umakant Feb 06 '18 at 13:13