Dealing with ambiguous column names in data.table

Question

I'm reading data from an Excel file into a data.table in R. The file is formatted like so:

   COL_1_STUFF COL_2_STUFF COL_3_STUFF
ID EST MOE PCT EST MOE PCT EST MOE PCT

That is, for each variable (the COLs), there is an estimate, a margin of error, and a percentage given.

The trouble is being created by read.xlsx2, which I'm using to import the file like so:

data <- as.data.table(read.xlsx2(
  "file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L))

The problem is that read.xlsx2 assigns the same column name to a bunch of stuff--the import looks something like:

ID EST EST EST

even if I set header = FALSE, I'm apt to get something like

X1 X2 X2 X2

To circumvent this, I've done the following subsequent to import:

data[ , c("col1_est", "EST") := .(EST, NULL)]
data[ , c("col2_est", "EST") := .(EST, NULL)]
data[ , c("col3_est", "EST") := .(EST, NULL)]

This strikes me as an odd way to deal with the problem; can anyone suggest an alternate approach to this?

May be you could try to convert your Excel file to csv and use `fread` of data.table to read the data into R. — KFB, Jan 10 '15 at 00:04
It sounds like you're dealing with multi-row headings. What do you want the output to be like? — A5C1D2H2I1M1N2O1R2T1, Jan 10 '15 at 03:55
@KFB that crossed my mind, but I'd like to find a solution that doesn't involve that--it's not the case for me, but I could imagine someone having a large number of files formatted like mine looking for a programmatic solution — MichaelChirico, Jan 10 '15 at 22:27
@AnandaMahto yes, that's the vocab word I was looking for! as long as I can successfully import to a data frame / data table object without ambiguously named columns, I can deal with the rest of cleaning/manipulation within R. — MichaelChirico, Jan 10 '15 at 22:29

MichaelChirico · Accepted Answer · 2021-05-06T22:43:52.417

3

This can be done with the check.names argument to fread, data.table, and setDT:

DT <- read.xlsx2(
  "file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L)
)
setDT(DT, check.names = TRUE)

Automatically adds .1, .2, etc. to duplicated column names.

edited May 06 '21 at 22:43

answered Jul 18 '15 at 16:50

MichaelChirico

1 Answers1