19

I write code that runs in high throughput production environments expected to handle a variety of input from multiple users. In this environment, I often need to join (with dplyr) two multi-column dataframes with join columns of mismatched types which produces this error. "Can't join on ... because of incompatible types"

The production code is expected to handle input csvs that are 150+ columns by 0-1000 rows, with 12-20 join columns. I use read.table for speed, and because the content of the 150 columns can vary, allow type to be inferred by data content.

It would be great to have a general-purpose function that matches join column types programmatically, making no assumptions about column name or location.

Sagar Zala
  • 4,854
  • 9
  • 34
  • 62
GGAnderson
  • 1,993
  • 1
  • 14
  • 25
  • What do you mean when you say *join*? There are actually a couple of possibilities https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right – Kevin Cazelles Mar 10 '18 at 23:54
  • Here, join means to use dplyr to join two large dataframes by columns that share a name. – GGAnderson Jan 25 '19 at 18:18

2 Answers2

19

This is a frequently viewed question, so many others must run into the error, so deserves a more complete answer.

The simple solution for correcting this join error is to simply mutate the class of the column(s) causing the problem. This can be done as follows:

  1. glimpse the column classes in the dataframes to be joined
  2. mutate the column class to match using as.numeric, as.logical or as.character. For example:

    df2 <- df2 %>%  
        mutate(column1 = as.numeric(column1))
    

A solution for production environments is in the matchColClasses function shown, which does the following:

  1. Identify columns that share the same name (sharedColNames)
  2. Use the master data frame (df1) to identify the shared columns classes
  3. Reassign column classes in df2 to match df1

    matchColClasses <- function(df1, df2) {
    
      sharedColNames <- names(df1)[names(df1) %in% names(df2)]
      sharedColTypes <- sapply(df1[,sharedColNames], class)
    
      for (n in sharedColNames) {
         class(df2[, n]) <- sharedColTypes[n]
      }
    
      return(df2)
     }
    

This function works well in our production environment, with heterogenous data types; character, numeric and logical.

GGAnderson
  • 1,993
  • 1
  • 14
  • 25
  • 1
    Using `matchColClasses`, I get `object cannot be coerced to type 'double'` . – Etienne Low-Décarie Mar 22 '19 at 13:55
  • @EtienneLow-Décarie can you share more information about the content of the column that is failing? – GGAnderson Mar 22 '19 at 21:09
  • 1
    the error message "object cannot be coerced to a double" most often means there are alpha entries in the column. – GGAnderson Aug 22 '19 at 22:12
  • 2
    This is now available by default in dplyr: https://github.com/tidyverse/dplyr/issues/2355 – cimentadaj Mar 06 '20 at 09:42
  • 1
    It saved me hours. Thank you. When working with labelled data you may add this in the for loop: attributes(df2[,n]) <- attributes(df1[,n]) – Thomas Buhl Sep 29 '20 at 11:39
  • 1
    The above only works for length(sharedColNames) > 1. Use: sharedColTypes <- sapply(df1[,sharedColNames, drop = FALSE], class) – mharinga Feb 16 '21 at 12:01
  • You may may have an issue when the number of shared column is 1. Therefore, using a subset(), a select() or drop = FALSE would fix that. [edit] Just saw a comment above that touches that already – yeahman269 Mar 21 '23 at 12:03
2

If the problem is that the classes of columns are not correctly detected after reading the data - you can specify the classes for each column at the data reading step.

So with read.csv() function you would do it like this:

df1 <- read.csv("data.csv", colClasses=c("numeric", "Date", "character"))
df2 <- read.csv("data.csv", colClasses=c("numeric", "Date", "character"))

After this df1 and df2 will have columns of the same types.

The classes here are just an example, you will have to specify the ones that apply to your data.

Also take a look at help(read.table)

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
  • Yes, this is quite basic - assign types on read. However, in a production setting with more than 150 columns, where users can add their own this is not really practical. The inference works well, except when input data is just 1 rows and what should be a character column containing "T" is inferred to be logical. – GGAnderson Mar 11 '18 at 01:04
  • @GGAnderson hmm if logical columns is the only problem maybe it's best to simply convert all logical columns to character after reading data? Or does your use cases include having logical columns sometimes? – Karolis Koncevičius Mar 11 '18 at 01:20
  • yes, some of the columns are logical. some are character, some are integer, some are doubles. Running the join candidates through the function shown above does a fairly good job of it. I just don't like having the for loop in an R function. – GGAnderson Mar 11 '18 at 08:02