4

In readr, the read_csv command handles duplicate column names by renaming the second duplicate and leaves the first unaltered. See the following example, taken from https://github.com/tidyverse/readxl/issues/53.

readr::read_csv("x,x,y\n1,2,3\n")
#> Warning: Duplicated column names deduplicated: 'x' => 'x_1' [2]
#> # A tibble: 1 × 3
#>       x   x_1     y
#>   <int> <int> <int>
#> 1     1     2     3

How can I get readxl::read_excel handle duplicate columns the same way?

Rob Creel
  • 323
  • 1
  • 8

1 Answers1

4

You can use the .name_repair argument and pass make.unique() as a function:

library(readxl)

read_excel(path = "temp.xlsx", .name_repair = ~make.unique(.x, sep = "_"))

# A tibble: 1 x 3
      x   x_1     y
  <dbl> <dbl> <dbl>
1     1     2     3
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • This does what I asked, but it has a bad side effect. My actual Excel files have blank columns (I have no control over this). So it leaves the first blank column named as the empty string, which causes problems with dplyr down the road. I can use `select(-"") `. to fix, but I'm just curious, is it possible to pre-empt this problem on the read-in? – Rob Creel Nov 17 '20 at 03:15
  • 1
    You can rename the empty columns to something more identifiable on the read-in (so they become for example ("blank", "blank_1") etc - then you can easily remove them with using `select(-starts_with("blank"))`. is that what you had in mind? – Ritchie Sacramento Nov 17 '20 at 04:04
  • Sort of. I think you're still describing a rename after the read in, which seems like it would totally work, but I'm still academically curious about whether it could be done as part of the read in function call itself. – Rob Creel Nov 17 '20 at 16:29
  • 1
    Not exactly - the rename is part of the read-in using name repair. `read_excel(path = "temp.xlsx", .name_repair = ~make.unique(sub("^$", "blank", .x), sep = "_")) %>% select(-starts_with("blank"))` – Ritchie Sacramento Nov 17 '20 at 20:00