0

I'm using a longitudinal survey in long format, and I'm trying to create a dummy variable for if an individual has NOT got a college degree by the age of 25. My data looks something like this:

 ID   CYRB   VAR      VALUE
 1    1983   DEG98    1
 1    1983   DEG00    1 
 1    1983   DEG02    1
 1    1983   DEG04    0
 2    1979   DEG08    0
 2    1979   DEG00    0
 2    1979   DEG02    1
 2    1979   DEG04    1
 3    1978   DEG98    NA
 3    1978   DEG00    NA
 3    1978   DEG02    NA
 3    1978   DEG04    0

As I've tried to illustrate, there are quite a few missing data points for survey responses in the relevant years. But clearly if the respondent responds no in later years it can be inferred that they didn't have a degree when they were <25 either.

Trying to be as general as possible, how can I create a new variable that depends on all the variable values of just one individual, i.e. for ID = 1, 2, 3 etc.?

Sorry if I'm not clear!

Edit:

Sorry my fault, the data used to be in wide format and the variables denote whether the respondent has a college degree in 1998, 2000, 2002 etc. (with value denoting the response 1 == TRUE, 0 == FALSE), CYRB is indeed year of birth, the table edited for the expected output of my desired dummy variable would be:

 ID   CYRB   VAR      VALUE   DUMMY
 1    1983   DEG98    0       0
 1    1983   DEG00    0       0 
 1    1983   DEG02    0       0
 1    1983   DEG04    1       0
 2    1979   DEG08    0       0
 2    1979   DEG00    0       0
 2    1979   DEG02    1       0
 2    1979   DEG04    1       0
 3    1978   DEG98    NA      1
 3    1978   DEG00    NA      1
 3    1978   DEG02    NA      1
 3    1978   DEG04    0       1

i.e. if the respondent replies in any survey from the age of 25 onwards that he/she does not have a college degree the dummy takes the value of 1.

Hope this is a bit clearer.

Milhouse
  • 177
  • 3
  • 11
  • 3
    You are not very clear actually. What is your expected output? Is CYRB year of birth? What is `DEG98` Vs `DEG00` for ID = 1?, what does `VALUE` variable represent? – Sotos Aug 05 '16 at 13:28
  • What are the values of the new variable supposed to be for ID = 1, 2, 3, given the data in your example? – Weihuang Wong Aug 05 '16 at 13:52
  • Sorry I was really sloppy the first time around, edited for clarity! – Milhouse Aug 05 '16 at 14:37
  • ID 2 reports 0 for 2000, 1 for 2002 and 2004, and 0 again for 2008. By your logic, the `DUMMY` should be 1, no? Or is there a typo in the first row for ID 2, and it should be "DEG98" not "DEG08"? – Weihuang Wong Aug 05 '16 at 14:41

1 Answers1

1

Assuming you meant "DEG98" in the first row for ID 2:

First, recover the respondent's age:

d$survey_year <- as.numeric(sapply(d$VAR, substring, 4, 5))
d$survey_year <- ifelse(d$survey_year<20, 2000+d$survey_year, 1900+d$survey_year)
d$age <- d$survey_year - d$CYRB

Use the any() function to test your criteria:

degree <- data.frame(DUMMY=c(
    by(d, d$ID, function(x) any(x$VALUE==0 & x$age>25))))
degree$ID <- rownames(degree)

Combine the dummy values with the original dataframe:

out <- merge(d[,c("ID", "CYRB", "VAR", "VALUE")], degree, all.x=TRUE)

Output:

> out
   ID CYRB   VAR VALUE DUMMY
1   1 1983 DEG98     0 FALSE
2   1 1983 DEG00     0 FALSE
3   1 1983 DEG02     0 FALSE
4   1 1983 DEG04     1 FALSE
5   2 1979 DEG98     0 FALSE
6   2 1979 DEG00     0 FALSE
7   2 1979 DEG02     1 FALSE
8   2 1979 DEG04     1 FALSE
9   3 1978 DEG98    NA  TRUE
10  3 1978 DEG00    NA  TRUE
11  3 1978 DEG02    NA  TRUE
12  3 1978 DEG04     0  TRUE

EDIT: A more parsimonious solution using the dplyr package. First, write a getYear() function to convert DEGxx to the actual year:

getYear <- function(x) {
    x <- as.numeric(substring(x, 4, 5))
    ifelse(x<16, 2000+x, 1900+x)
}

Then transform the dataset:

library(dplyr)
d %>% group_by(ID) %>%
  mutate(survey_year=getYear(VAR),
    age=survey_year - CYRB,
    DUMMY=any(VALUE==0 & age>25))

Output:

Source: local data frame [12 x 7]
Groups: ID [3]

      ID  CYRB    VAR VALUE DUMMY survey_year   age
   (int) (int) (fctr) (int) (lgl)       (dbl) (dbl)
1      1  1983  DEG98     0 FALSE        1998    15
2      1  1983  DEG00     0 FALSE        2000    17
3      1  1983  DEG02     0 FALSE        2002    19
4      1  1983  DEG04     1 FALSE        2004    21
5      2  1979  DEG98     0 FALSE        1998    19
6      2  1979  DEG00     0 FALSE        2000    21
7      2  1979  DEG02     1 FALSE        2002    23
8      2  1979  DEG04     1 FALSE        2004    25
9      3  1978  DEG98    NA  TRUE        1998    20
10     3  1978  DEG00    NA  TRUE        2000    22
11     3  1978  DEG02    NA  TRUE        2002    24
12     3  1978  DEG04     0  TRUE        2004    26
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48