14

I would like to merge a large set of dataframes (about 30), which each have about 200 variables. These datasets are very much alike but not identical.

Please find two example dataframes below:

library(data.table)
library(haven)
df1 <- fread(
    "A   B   C  iso   year   
     0   B   1  NLD   2009   
     1   A   2  NLD   2009   
     0   Y   3  AUS   2011   
     1   Q   4  AUS   2011   
     0   NA  7  NLD   2008   
     1   0   1  NLD   2008   
     0   1   3  AUS   2012",
  header = TRUE
)
df2 <- fread(
    "A   B   D  E  iso   year   
     0   1   1  NA ECU   2009   
     1   0   2  0  ECU   2009   
     0   0   3  0  BRA   2011   
     1   0   4  0  BRA   2011   
     0   1   7  NA ECU   2008   
     1   0   1  0  ECU   2008   
     0   0   3  2  BRA   2012   
     1   0   4  NA BRA   2012",
  header = TRUE
)

To recreate the error:

class(df2$B) <- "anything"

When I do the following

df_merged <- rbindlist(list(df1, df2), fill=TRUE, use.names=TRUE)

The dataset gives the error:

Error in rbindlist(list(df1, df2), fill = TRUE, use.names = TRUE) : 
  Class attribute on column 2 of item 2 does not match with column 2 of item 1.

What can I do to either:

  1. Make rbindlist skip the column which does not match and add some suffix.
  2. Change the class of one of the columns to the other one.

Desired result for option 1:

df_merged <- fread(
    "A   B  B.x  C  D   E   iso   year   
     0   A   NA  1  NA  NA  NLD   2009   
     1   Y   NA  2  NA  NA  NLD   2009   
     0   Q   NA  3  NA  NA  AUS   2011   
     1   NA  NA  4  NA  NA  AUS   2011   
     0   0   NA  7  NA  NA  NLD   2008   
     1   1   NA  1  NA  NA  NLD   2008   
     0   1   NA  3  NA  NA  AUS   2012   
     0   NA  1   NA  1  NA  ECU   2009   
     1   NA  0   NA  2  0   ECU   2009   
     0   NA  0   NA  3  0   BRA   2011   
     1   NA  0   NA  4  0   BRA   2011   
     0   NA  1   NA  7  NA  ECU   2008   
     1   NA  0   NA  1  0   ECU   2008   
     0   NA  0   NA  3  2   BRA   2012   
     1   NA  0   NA  4  NA  BRA   2012",
   header = TRUE
)

Desired result for option 2:

df_merged <- fread(
    "A   B   C  D   E   iso   year   
     0   3   1  NA  NA  NLD   2009   
     1   4   2  NA  NA  NLD   2009   
     0   5   3  NA  NA  AUS   2011   
     1   5   4  NA  NA  AUS   2011   
     0   0   7  NA  NA  NLD   2008   
     1   1   1  NA  NA  NLD   2008   
     0   1   3  NA  NA  AUS   2012   
     0   1   NA  1  NA  ECU   2009   
     1   0   NA  2  0   ECU   2009   
     0   0   NA  3  0   BRA   2011   
     1   0   NA  4  0   BRA   2011   
     0   1   NA  7  NA  ECU   2008   
     1   0   NA  1  0   ECU   2008   
     0   0   NA  3  2   BRA   2012   
     1   0   NA  4  NA  BRA   2012",",
   header = TRUE
)
Tom
  • 2,173
  • 1
  • 17
  • 44
  • 1
    It doesn't give me any error and gives your desired result option 2. – Ronak Shah Apr 16 '19 at 11:06
  • Did you see this part: `class(df2$B) <- "haven_labelled"` ? – Tom Apr 16 '19 at 11:09
  • yes, still. no error. Don't know what might be the issue at either end. – Ronak Shah Apr 16 '19 at 11:26
  • That is really weird. I copied it from the post (to retest it) before I replied to you.. – Tom Apr 16 '19 at 11:28
  • Could it be because you don't have the library(haven) package installed? – Tom Apr 16 '19 at 11:31
  • I can't reproduce your issue either. What's the sense to load `haven` here at all? – jay.sf Apr 16 '19 at 11:57
  • I am using the `library(haven)` because my databases are from stata and I want to retain the labels. For some columns it creates a separate class called `haven_labelled`. I assumed that it might work differently when the package `haven` is not installed. – Tom Apr 16 '19 at 12:08
  • @RonakShah and jay.sf I have changed the class to a random not existing class, would you mind trying again with the edited example? I still got the same error, for both my dataset and the example and on different machines. – Tom Apr 18 '19 at 12:11
  • I just tried again. Still no error and I get your desired result option 2. Steps I did - 1) Copied `df1` and `df2`. 2) Changed the class `class(df2$B) <- "anything"` and 3) Ran `rbindlist(list(df1, df2), fill=TRUE, use.names=TRUE)` . My `packageVersion('data.table') #[1] ‘1.12.0’` and `packageVersion('haven') [1] ‘2.1.0’` – Ronak Shah Apr 18 '19 at 13:37
  • This is so frustrating, but thank you very much for trying. – Tom Apr 18 '19 at 13:39
  • 1
    I do get the same issue @Tom! And would be very pleased if someone comes with the solution 2. – EdM Jan 07 '20 at 19:16

4 Answers4

5

I came up with this inelegant solution that bypasses the problem. Basically, What I am doing is to assign the attributes of the columns of the first item of the list to the columns with the same names of all the other items of the list. Keep in mind that this solution is problematic and, depending on the project, it could be a very wrong practice as it has the potential to mess up your data. However, if what you need is to use rbindlist to combine your dataframes, this makes the trick


dfs <- list(df1, df2)
varnames <- names(dfs[[1]]) # variable names
vattr <- purrr::map_chr(varnames, ~class(dfs[[1]][[.x]])) # variable attributes

for (i in seq_along(dfs)) {
  # assign the same attributes of list 1 to the rest of the lists
  for (j in seq_along(varnames)) {
    if (varnames[[j]]  %in% names(dfs[[i]])) {
      class(dfs[[i]][[varnames[[j]]]]) <- vattr[[j]]
    } 
  }
}


df_merged <- data.table::rbindlist(dfs, fill=TRUE, use.names=TRUE)

Best,

  • Thank you very much for posting this! – Tom Jan 14 '20 at 16:16
  • I have been using your solution, but I am now getting the error: `Warning messages: 1: In class(All[[i]][[varnames[[j]]]]) <- vattr[[j]] : NAs introduced by coercion` (four of them). I cannot really figure out where or why they happen. But I was wondering, do they just make the class NA? Or do they make the actual variable NA? – Tom Mar 13 '20 at 17:20
  • Just noting that the attributes line won't work if you have a variable with multiple classes (common with dates). You can do something like `vattr <- purrr::map_chr(varnames, ~class(data_list[[1]][[.x]])[1])` to make this work. – James Martherus Feb 16 '23 at 18:28
4

An even more inelegant solution to @R.Andres Castaneda's answer is:

when using fread set colClasses = "character" to coerce everything to character, do the rbindlist, then choose your method of preference for coercing everything back to something sensible afterwards.

It's not elegant but I often find this the easiest method when reading in inconsistent data.

Mooks
  • 593
  • 4
  • 12
3

Try plyr::ldply(list, data.frame) as a work around. Worked for me, rbindlist() didn't like a date column.

Update:

library(plyr) 
my_list <- list("apple", "banana", "cherry")
my_dataframe <- plyr::ldply(my_list, data.frame)
names(my_dataframe) <- "Fruit"
rm1104
  • 193
  • 3
  • 8
0
df_merged <- plyr::rbind.fill(list(df1, df2))