I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were only asked during one wave. Most answers were on a scale (e.g. how much do you agree) or a binary. Each individual has a unique numeric ID so I know which rows to collapse. My dependent variable was only surveyed in wave 4.
The data looks something like this:
df <- data.frame (ID = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))
df
ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1 1 1 68 2 2 2 10
2 1 4 68 2 2 NA 10
3 2 1 52 1 1 NA 8
4 2 2 52 1 1 3 9
5 2 3 52 1 1 NA 8
6 2 4 52 1 1 3 7
7 3 2 43 2 2 2 6
8 4 3 33 2 2 3 6
9 4 4 33 1 1 NA 5
etc...
Sometimes each individual's answers are the same across waves, but not always. I do not need to know how the answers vary in time and the waves were relatively close in time. I just need a profile of each individual surveyed.
Ideally, for each individual I'd want to keep the answers given in wave 4 (if they took part in it), but substituting the NA answers with what they answered in previous waves. I was wondering if there is any way to do this without going through every individual one by one, given the amount of data. I'll also have to remove data for individuals who did not take part in wave 4 at all.
If successful, the chunk of data above would end up looking something like this (+ I'll remove the panel column later):
ID PANEL AGE Q4 Q4_1 Q5
1 1 4 68 2 2 10
2 2 4 52 1 3 7
3 4 4 33 1 3 5
etc...
I've been looking into dplyr's summarise() function but it doesn't seem like I can be that specific with what I need to merge and not merge. It wouldn't be a problem if some of the answers were merged by getting an average of the individual's responses across waves, but that would not work for the binary answers if the individual changed their mind in between waves.