0

I have a data frame/tibble df1 with each of hundreds of columns/variables painstakingly set to the correct data type (double, char, date, time, logical).

I'm periodically provided a df2 that I need to append to df1. df2 has identical variable names, count, and order as df1, but column data types do not necessarily match those of df1 (due to the source for df2 variables sometimes missing, and therefore not recognized as e.g. date or time). df2 is provided "as is": its import is out of my control.

is there a (preferably tidyverse) solution for setting/converting every df2 column type according to its corresponding df1 column's data type, so that I can continue on merging the dfs with bind_rows etc? trying to avoid hardcoding if possible.

mrroy
  • 13
  • 4
  • 1
    Does this answer your question? [Create a col\_types string specification for read\_csv based on existing dataframe](https://stackoverflow.com/questions/55249599/create-a-col-types-string-specification-for-read-csv-based-on-existing-dataframe) – Mark Jul 16 '23 at 13:49
  • thx, but not quite since my df2 is not read from file but returned from querying some other source. but looks very convenient for when I _do_ need that functionality some day – mrroy Jul 16 '23 at 15:14

2 Answers2

2

Sample data:

mt1 <- mtcars[1:3,]
mt2 <- mtcars[1:3,]
class(mt2$cyl) <- "character"
sapply(mt2, class)
#         mpg         cyl        disp          hp        drat          wt        qsec          vs          am 
#   "numeric" "character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
#        gear        carb 
#   "numeric"   "numeric" 

Base R

The simplest:

mt2fixed <- Map(function(this, oth) `class<-`(this, class(oth)), mt2, mt1) |>
  as.data.frame()
sapply(mt2fixed, class)
#       mpg       cyl      disp        hp      drat        wt      qsec        vs        am      gear      carb 
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 

The use of `class<-` in a single expression is equivalent to a reassignment and returning the updated vector. For instance, these two are equivalent:

`class<-`(vec, newclass)
{ class(vec) <- newclass; vec; }

The biggest difference here is that the first allows a shorter (fewer characters) anon-function, no need for surrounding braces. (Same applies to the dplyr solution below.)

If done in-place, it can be a little less verbose:

Or in-place a little more briefly:

```r
mt2[] <- Map(function(this, oth) `class<-`(this, class(oth)), mt2, mt1)

The use of mt2[] on the LHS of the assignment ensures that the overall class of "data.frame" is preserved (otherwise it'll be a list).

dplyr

library(dplyr)
tibble(mt2)
# # A tibble: 3 × 11
#     mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  21   6       160   110  3.9   2.62  16.5     0     1     4     4
# 2  21   6       160   110  3.9   2.88  17.0     0     1     4     4
# 3  22.8 4       108    93  3.85  2.32  18.6     1     1     4     1
tibble(mt2) %>%
  mutate(across(everything(), ~ `class<-`(.x, class(mt1[[cur_column()]]))))# # A tibble: 3 × 11
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
# 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
# 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • excellent: both work except for throwing an error in `date_validate`, either directly in the dplyr approach, or when converting to tibble after base R. this due to some date-variables-to-be in df2 being char with values of NA. solved by explicitly mutating these vars right before your code, using `mutate(df2,across(all_of(myDateVars),as.Date))`. guessing `as.Date` somehow turns NA into something that can now pass the `date_validate` check? still a little messy bc I need to locate `myDateVars`, but if that's the only data type that needs some special treatment that's fine :) – mrroy Jul 16 '23 at 15:04
  • Yes, `as.Date` sometimes needs `format=` (if not am unambiguous format such as `"%Y-%m-%d"`) or `origin=` (if numeric), those are odd exceptions. There might be ways to work around this, involving conditional logic of `if (inherits(x, "Date")) ... else if (inherits(x, "POSIXt")) ... else ...`. That's likely the easiest way to repeat it without needing your `myDateVars` steps. (Side note: this would have been apparent had you shared sample data in your question.) – r2evans Jul 16 '23 at 18:35
0

In tidyverse you could use type_convert:

ie create a vector containing the names and correct data types from the correct dataset ie df1. Or you can create it manually. Then use this within the type_convert function to modify the contents of the new imported dataset:

type_convert(df2, cols(!!!map_char(df1, class)))
Onyambu
  • 67,392
  • 3
  • 24
  • 53