0

I realise that there are already a bunch of topics out there on merging datasets, but I just started using R and I have difficulty understanding some of the answers given, especially when I try to apply them to my own specific situation.

I have around 80 STATA datasets that I want to merge, they all got the variables var1 and var2 in common, but can differ in other variables (and the number of variables). So I read that I need to create a list of my datasets first, when creating a list of foreign datasets, do I also need to read them into R using read.dta?

I'm trying to do this by:

temp = list.files(pattern="*.dta")
#Loop through all of the databases
for (i in 1:length(temp)) {
  list <- read.dta13(temp[i], nonint.factors = TRUE)
}

But I'm getting the feeling that I'm doing this wrong.

Once I got a list of databases, do I then use merge_all(list, by=c("var1","var2))?

Oscar
  • 41
  • 2
  • 9
  • Use lapply to loop through input files: `myStataList <- lapply(list.files(pattern="*.dta"), read.dta13, nonint.factors = TRUE)` – zx8754 Jul 04 '16 at 11:12
  • @zx8754 Thank you! When I try to use merge_all, it takes a million years, got any ideas there? – Oscar Jul 04 '16 at 11:24
  • This would depend on the size of your data and memory. If they are all same structure, why not just rbind? Or do you have to merge them on key columns var1 and var2? Test with subset `lapply(list.files(pattern="*.dta")[1:4], ...` – zx8754 Jul 04 '16 at 11:25
  • When I tried to use merge_all on the entire dataset I got ``In `is.na<-.default`(`*tmp*`, value = zap) : Reached total allocation of 8072Mb: see help(memory.size)``, and when I tested it with a subset I got ``Error in `[.data.frame`(df, , match(names(dfs[[1]]), names(df))) : undefined columns selected.`` Yes I have to merge them on key columns var1 and var2. – Oscar Jul 04 '16 at 11:51
  • Simply use do.call(): `do.call(rbind, myStataList)` – Parfait Jul 06 '16 at 02:07

0 Answers0