0

so I´m trying to set up my dataset for event-history analysis and for this I need to define a new column. My dataset is of the following form:

ID   Var1
1    10
1    20  
1    30  
1    10
2    4
2    5
2    10
2    5
3    1
3    15
3    20
3    9
4    18
4    32
4    NA
4    12
5    2
5    NA
5    8
5    3

And I want to get to the following form:

ID   Var1   Var2
1    10      0
1    20      0
1    30      1
1    10      0
2    4       0
2    5       0
2    10      0
2    5       0
3    1       0
3    15      0
3    20      1
3    9       0
4    18      0
4    32      NA
4    NA      1
4    12      0
5    2       NA
5    NA      0
5    8       1
5    3       0

So in words: I want the new variable to indicate, if the value of Var1 (with respect to the group) drops below 50% of the maximum value Var1 reaches for that group. Whether the last value is NA or 0 is not really of importance, although NA would make more sense from a theoretical perspective. I´ve tried using something like

DF$Var2 <- df %>%
  group_by(ID) %>%
  ifelse(df == ave(df$Var1,df$ID, FUN = max), 0,1)

to then lag it by 1, but it returns an error on an unused argument 1 in ifelse.

Thanks for your solutions!

  • Is your expected output correct? Try something like `df %>% group_by(ID) %>% mutate(Var2 = as.integer(Var1 > 0.5*max(Var1)))` – Sotos Jul 29 '20 at 08:21

1 Answers1

1

Here is a base R option via ave + cummax

within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))

which gives

> within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
   ID Var1 Var2
1   1   10    0
2   1   20    0
3   1   30    1
4   1   10    0
5   2    4    0
6   2    5    0
7   2   10    0
8   2    5    0
9   3    1    0
10  3   15    0
11  3   20    1
12  3    9    0

Data

> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L), Var1 = c(10L, 20L, 30L, 10L, 4L, 5L, 10L, 5L, 1L, 15L,
20L, 9L)), class = "data.frame", row.names = c(NA, -12L))

Edit (for updated post)

f <- function(v) {
  u1 <- c(replace(v,!is.na(v),0),0)[-1]
  v[is.na(v)] <- v[which(is.na(v))-1]
  u2 <- c((v<max(v)/2 & cummax(v)==max(v))[-1],0)
  u1+u2
}

within(df,Var2 <- ave(Var1,ID,FUN = f))

such that

> within(df,Var2 <- ave(Var1,ID,FUN = f))
   ID Var1 Var2
1   1   10    0
2   1   20    0
3   1   30    1
4   1   10    0
5   2    4    0
6   2    5    0
7   2   10    0
8   2    5    0
9   3    1    0
10  3   15    0
11  3   20    1
12  3    9    0
13  4   18    0
14  4   32   NA
15  4   NA    1
16  4   12    0
17  5    2   NA
18  5   NA    0
19  5    8    1
20  5    3    0

Data

df <- tructure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,    
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), Var1 = c(10L, 20L, 30L, 
10L, 4L, 5L, 10L, 5L, 1L, 15L, 20L, 9L, 18L, 32L, NA, 12L, 2L,   
NA, 8L, 3L)), class = "data.frame", row.names = c(NA, -20L))   
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Thanks a lot for your answer, it gives the expected results. However there is one issue: Some groups contain a few NAs in Var1 and the code returns the complete column Var2 as NA if there is one NA. I tried solving it with an ifelse statement but it keeps returning an error. Is there an easy, direct way to fix this? – philipp.kn_98 Jul 29 '20 at 20:59
  • @philipp.kn_98 You are welcome. Could you provide a small example that has NAs in Var1 as well as the expected result? Then I would have a try again – ThomasIsCoding Jul 29 '20 at 21:10
  • Just edited the question to provide the information. What happens so far is that the code returns NA for the entire column in group 4 & 5, while the output needed just kinda shifts the NAs one row up (logical as the question asked is: Does the next row score below XY threshold?). Thank you for your effort!! – philipp.kn_98 Jul 30 '20 at 07:49
  • @philipp.kn_98 Could you explain a bit why `Var2` for `ID=4` and `ID = 5` are like that? I have no clue about its logic. For example, when `ID = 4`, the row `Var1 = 32` gives `Var2 = NA`, but why `Var2 = 1` when the next row has `Var1 = NA`? – ThomasIsCoding Jul 30 '20 at 08:31
  • Sure: Var2 is supposed to indicate what happens in the next row in Var1. So when Var2 takes on 1 it means that in the next row, the value of Var1 falls below 50% of it´s maximum. If there is NA in Var1, Var2 can´t indicate what happens in the next row in Var1, because what happens is not given in the dataframe. However: if Var1 = NA, Var2 does not have to be NA, if Var1 takes on a value in the following row. So to wrap it up: As Var2 is always based on Var1´s alue in the next row it should also consider NAs in the next row and not the given row. – philipp.kn_98 Jul 30 '20 at 08:48
  • @philipp.kn_98 When you say a maiximum within a group, do you mean a maximum after omitting `NA`? Also, as you said, "if Var1 = NA, Var2 does not have to be NA, if Var1 takes on a value in the following row.", then why `Var2 = 1` when `Var1 = NA` in `ID = 4` but `Var2 = 0` when `Var1 = NA` in `ID = 5`? Do you just skip those `NA`s in `Var1`? – ThomasIsCoding Jul 30 '20 at 09:39
  • 1
    @philipp.kn_98 I updated my answer, please check if that works for you – ThomasIsCoding Jul 30 '20 at 10:09
  • unfortunately it still returns NA for the entire column for the groupd that contain an NA in Var1 – philipp.kn_98 Jul 31 '20 at 10:46
  • @philipp.kn_98 Sorry that I have no clue what happened or how to fix the issue – ThomasIsCoding Jul 31 '20 at 21:46