1

I have a question regarding data preparation. I have the following data set (in long format; one row per measurement point, therefore several rows per person):

dd <- read.table(text=
"ID    time
1      -4
1      -3
1      -2
1      -1
1       0
1       1
2      -3
2      -1
2       2
2       3
2       4
3      -3
3      -2
3      -1
4      -1
4       1   
4       2
4       3
5       0
5       1
5       2
5       3
5       4", header=TRUE) 

Now I would like to create a new variable that has a 1 in the row, in which a sign change on the time variable happens for the first time for this person, and a 0 in all other rows. If a person has only negative values on time, the should not be any 1 on the new variable. For a person that has only positive values on time, the first row should have a 1 on the new variable and all other rows should be coded with 0. For my example above the new data frame should look like this:

dd <- read.table(text=
"ID    time   new.var
1      -4     0
1      -3     0
1      -2     0
1      -1     0
1       0     1
1       1     0
2      -3     0
2      -1     0
2       2     1
2       3     0
2       4     0
3      -3     0  
3      -2     0
3      -1     0
4      -1     0
4       1     1  
4       2     0
4       3     0
5       0     1
5       1     0
5       2     0
5       3     0
5       4     0", header=TRUE) 

Does anyone know how to do this? I thought about using dplyr and group_by, however I am pretty new to R and did not make it. Any help is much appreciated!

divibisan
  • 11,659
  • 11
  • 40
  • 58
Mary B.
  • 125
  • 7
  • Also related: https://stackoverflow.com/q/45154687/8366499, https://stackoverflow.com/q/48378653/8366499 – divibisan Jun 24 '20 at 16:13

3 Answers3

0

You can try this:

library(dplyr)
dd %>% left_join(dd %>% group_by(ID) %>% summarise(index=min(which(time>=0)))) %>%
  group_by(ID) %>% mutate(new.var=ifelse(row_number(ID)==index,1,0)) %>% select(-index)-> DF

# A tibble: 23 x 3
# Groups:   ID [5]
      ID  time new.var
   <int> <int>   <dbl>
 1     1    -4       0
 2     1    -3       0
 3     1    -2       0
 4     1    -1       0
 5     1     0       1
 6     1     1       0
 7     2    -3       0
 8     2    -1       0
 9     2     2       1
10     2     3       0
Duck
  • 39,058
  • 13
  • 42
  • 84
0

There are 2 different operations you want done to create new.var, so you need to do them in 2 steps. I'll break this into 2 separate mutate calls for simplicity, but you can put both of them into the same mutate

First, we group by ID and then find the rows where the sign changes. We need to use time >= 0 instead of sign as recommended in this answer: R identifying a row prior to a change in sign because you want a sign change to be counted only when going from -1 <-> 0, not from 0 <-> 1:

library(tidyverse)
dd2 <- dd %>%
    group_by(ID) %>%
    mutate(new.var = as.numeric((time >= 0) != (lag(time) >= 0)))

dd2
# A tibble: 23 x 3
# Groups:   ID [5]
      ID  time new.var
   <int> <int>   <dbl>
 1     1    -4      NA
 2     1    -3       0
 3     1    -2       0
 4     1    -1       0
 5     1     0       1
 6     1     1       0
 7     2    -3      NA
 8     2    -1       0
 9     2     2       1
10     2     3       0
# … with 13 more rows

Then we use case_when to modify the first row based on your desired rules. Due to the way lag works, the first row will always have NA (since there is no previous row to look at) which makes it a good way to pick out that first row to change it based on the time values in that group:

dd3 <- dd2 %>%
    mutate(new.var = case_when(
               !is.na(new.var) ~ new.var,
               all(time >= 0) ~ 1,
               TRUE ~ 0)
    )

print(dd3, n = 100) #n=100 because tibbles are truncated to 10 rows by print

# A tibble: 23 x 3
# Groups:   ID [5]
      ID  time new.var
   <int> <int>   <dbl>
 1     1    -4       0
 2     1    -3       0
 3     1    -2       0
 4     1    -1       0
 5     1     0       1
 6     1     1       0
 7     2    -3       0
 8     2    -1       0
 9     2     2       1
10     2     3       0
11     2     4       0
12     3    -3       0
13     3    -2       0
14     3    -1       0
15     4    -1       0
16     4     1       1
17     4     2       0
18     4     3       0
19     5     0       1
20     5     1       0
21     5     2       0
22     5     3       0
23     5     4       0
divibisan
  • 11,659
  • 11
  • 40
  • 58
0

The following ave instruction does what the question asks for.

dd$new.var <- with(dd, ave(time, ID, FUN = function(x){
  y <- integer(length(x))
  if(any(x >= 0)) y[which.max(x[1]*x <= 0)] <- 1L
  y
}))

dd
#   ID time new.var
#1   1   -4       0
#2   1   -3       0
#3   1   -2       0
#4   1   -1       0
#5   1    0       1
#6   1    1       0
#7   2   -3       0
#8   2   -1       0
#9   2    2       1
#10  2    3       0
#11  2    4       0
#12  3   -3       0
#13  3   -2       0
#14  3   -1       0
#15  4   -1       0
#16  4    1       1
#17  4    2       0
#18  4    3       0
#19  5    0       1
#20  5    1       0
#21  5    2       0
#22  5    3       0
#23  5    4       0

If the expected output is renamed dd2 then

identical(dd, dd2)
#[1] TRUE
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66