0

I'm actually not even quite sure how to ask this question, so please bear with me.

I've noticed that there is an error in the dataset I'm working with, the ANES Cumulative File. For one of the years in the dataset (2004), the values for one variable (which I renamed "grewup") were left out by accident, so it just says "NA" for that year. The values are there for the other years, so the dataset essentially looks something like this:

id   year   grewup
1    2002   127
2    2002   310
3    2004   NA
4    2004   NA
5    2008   332
6    2008   614

I do have another dataset that consists of just 2004 and has the missing values for "grewup". What I would like to do is recode the NAs for 2004 using the values from the second dataset. How would I do that? Again, the values are in the cumulative dataset for the rest of the years; I just want to recode for 2004 and leave the rest of the values alone.

Thanks.

SOME CLARIFICATIONS AND ADDITIONS:

  • I would like to only bring this one variable from the second dataset to avoid making the first dataset even more massive and memory-draining than it already is (951 columns). There are actually dozens of other variables they I either already have

  • Also, while all values for 2004 are NAs, not every NA in the dataset is for 2004. There are some in there for other years that were legitimately missing values.

  • do they have the same number of observations for each year? post the data frames – rawr Apr 09 '14 at 22:43
  • No, neither dataset has the same number of observations for each year. In the first dataset, the cases per year range from about 200 to over 2,700. The second dataset is just for 2004. (If you were asking if they have the same number of observations for 2004, then yes they do.) – user3517230 Apr 09 '14 at 23:09
  • oh that makes it much simpler, see the answer below – rawr Apr 09 '14 at 23:10

1 Answers1

0

You ought to be able to merge those dataframes by id and year:

 merge(dat1,dat2,by=c("id", "year"),all.x=TRUE)  # and "outer join"
  id year grewup.x grewup.y
1  1 2002      127       NA
2  2 2002      310       NA
3  3 2004       NA      438
4  4 2004       NA      834
5  5 2008      332       NA
6  6 2008      614       NA
 datm <- merge(dat1,dat2,by=c("id", "year"),all.x=TRUE)

 # No "fill in the blanks
 datm[is.na(datm$grewup.x), "grewup.x"] <- datm[is.na(datm$grewup.x), "grewup.y"] 
 # Notice that the logical index is the same on both sides of the assignment

 datm[ ! names(datm) %in% 'grewup.y' ]  # drop the supplementary column

  id year grewup.x
1  1 2002      127
2  2 2002      310
3  3 2004      438
4  4 2004      834
5  5 2008      332
6  6 2008      614
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • why? if the data is structured that simply, just `dat1[dat1$year %in% 2004, ]$grewup <- dat2[dat2$year %in% 2004, ]$grewup` will do – rawr Apr 09 '14 at 23:09
  • Only if the ids are exactly the same and in the same order. I saw nothing to ensure that. – IRTFM Apr 09 '14 at 23:17
  • That makes sense I think, but I do have a question: Would this merge every variable from the two datasets together? The example of the data I used was a very, very slimmed down version. There are actually 951 variables in the first dataset (and 49,760 observations) while the second dataset has 1,197 variables. – user3517230 Apr 09 '14 at 23:31
  • Well, it would create a much wider dataframe (951+1197-2) columns. And it would require using lapply with a function designed to work on pairs of column names. Moral: You should always present a dataset of sufficient complexity. – IRTFM Apr 09 '14 at 23:36