0

I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were only asked during one wave. Most answers were on a scale (e.g. how much do you agree) or a binary. Each individual has a unique numeric ID so I know which rows to collapse. My dependent variable was only surveyed in wave 4.

The data looks something like this:

df <- data.frame (ID  = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))

df

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     1  68  2    2      2 10
2  1     4  68  2    2     NA 10
3  2     1  52  1    1     NA  8
4  2     2  52  1    1      3  9
5  2     3  52  1    1     NA  8
6  2     4  52  1    1      3  7
7  3     2  43  2    2      2  6
8  4     3  33  2    2      3  6
9  4     4  33  1    1     NA  5

etc...

Sometimes each individual's answers are the same across waves, but not always. I do not need to know how the answers vary in time and the waves were relatively close in time. I just need a profile of each individual surveyed.

Ideally, for each individual I'd want to keep the answers given in wave 4 (if they took part in it), but substituting the NA answers with what they answered in previous waves. I was wondering if there is any way to do this without going through every individual one by one, given the amount of data. I'll also have to remove data for individuals who did not take part in wave 4 at all.

If successful, the chunk of data above would end up looking something like this (+ I'll remove the panel column later):

   ID  PANEL  AGE  Q4  Q4_1 Q5
1   1     4    68   2    2  10 
2   2     4    52   1    3   7 
3   4     4    33   1    3   5

etc...

I've been looking into dplyr's summarise() function but it doesn't seem like I can be that specific with what I need to merge and not merge. It wouldn't be a problem if some of the answers were merged by getting an average of the individual's responses across waves, but that would not work for the binary answers if the individual changed their mind in between waves.

ouroboro
  • 45
  • 4

1 Answers1

1

You may be looking for tidyr::fill().. You can do the following on your provided example dataset:

df %>% group_by(ID) %>% fill(., starts_with("Q")) %>% filter(PANEL == 4)

Output:

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     4  68  2    2      2 10
2  2     4  52  1    1      3  7
3  4     4  33  1    1      3  5
langtang
  • 22,248
  • 1
  • 12
  • 27
  • Thanks for your reply, @langtang! Does this fill the NA answers with the value from the row directly above (just wondering in case someone only participated in wave 4, would their answers be mixed with someone else's)? Also not all of my variables actually start with Q that's just the first 4. It's still a good solution as I have groups of variables starting with the same letter/few letters, but is there any faster way? – ouroboro Mar 07 '23 at 14:33
  • adding `group_by()` will ensure that the fill (which, yes, is done from above (see `.direction` param in the `fill()` function) is done only within ID.. If you can identify all your target question columns in a vector, say `target_cols`, you can use `fill(., all_of(target_cols))` – langtang Mar 07 '23 at 14:37