2

I have a large number of data frames to merge. Each has several hundred columns. I want to identify all mismatching column names before doing this. So far I can generate a list of mismatches, but format is terrible and I can't quite figure out how to tell which data frame they come from.

#create data
df1 <- data.frame("col1" = 3:4, "Age" = c(22,16), "Name" = c("James","Jim"))
df2 <- data.frame("col1" = 3:4, "Age" = c(18,19), "Name" = c("Mike","Mia"))
df3 <- data.frame("mismatch_col_name_1" = 1:2, "Age" = c(21,15), "name" = c("John","Dora"))
df4 <- data.frame("mismatch_col_name_2" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
files <- list(df1, df2, df3, df4)

# find mismatched column names
mismatches <- NULL
for (i in 1:(length(files) - 1)) {
  mismatches <- c(mismatches, setdiff(colnames(files[[i]]), colnames(files[[i+1]])))
}
mismatches <- c(mismatches, setdiff(colnames(files[[length(files)]]), colnames(files[[1]])))
print(mismatches)

[1] "col1"                "Name"                "mismatch_col_name_1" "name"               
[5] "mismatch_col_name_2"

Desired output would be something like:

"df3" "mismatch_col_name_1" "name"

"df4" "mismatch_col_name_2" "Name"

Or even df names and column numbers. Interested in any solutions or better ways to do this.

marcel
  • 389
  • 1
  • 8
  • 21
  • Can we be pretty confident that the position of the columns to always be consistent across all the files? (e.g. anything that looks like **name** will always be the 3rd column??) Do you know what the all the correct column names should be for at least one file? (i.e. in this example we know before hand that **col1**, **Age**, and **Name**) – Jas Oct 19 '19 at 08:58
  • I think the key question here is the last one @Jas asked, namely, do you have a master list of the desired/correct column names, or do you want to look for differences, period? – ulfelder Oct 19 '19 at 09:40

1 Answers1

2

Here's an approach that gets you to a list (in the R sense) containing mismatches for each file. It is predicated on the assumption that you know the "true" set of names against which to compare each file.

lapply(files, function(x) {

    # vector of desired names
    master <- c('col1', 'Age', 'Name')

    # use 'match' to compare this df's names to the master. the order of the
    # cols won't matter; if the name in x appears in master, 'match' will return
    # an integer indicating the position of the col with that name in x.
    comparison <- match(names(x),  master)

    # if all col names in x appear in master, you get all integers, so: NULL
    if (!any(is.na(comparison))) {

        NULL

    # if names in x don't appear in master, you get an NA from 'match', so here you
    # create a vector of the names in x that aren't in master. You could also capture
    # their position here if that's helpful.
    } else {

        mismatches <- names(x)[which(is.na(comparison))]

    }

})

Result:

[[1]]
NULL

[[2]]
NULL

[[3]]
[1] "mismatch_col_name_1" "name"               

[[4]]
[1] "mismatch_col_name_2"

There are various ways you could then organize or summarize the contents of this list, but that's mostly a matter of formatting.

ulfelder
  • 5,305
  • 1
  • 22
  • 40
  • Many thanks, I think this approach will work. In addition, is there an approach that can handle the situation where a column is simply missing from the master list (i.e. one of the datasets has a smaller number of columns)? – marcel Oct 19 '19 at 14:58
  • This approach works fine in that situation, too. Try `match(c(1,2), rev(1:4))` to see what happens when the object you're matching has fewer elements than the reference object. – ulfelder Oct 20 '19 at 10:09