-1

I'm reading data from an Excel file into a data.table in R. The file is formatted like so:

   COL_1_STUFF COL_2_STUFF COL_3_STUFF
ID EST MOE PCT EST MOE PCT EST MOE PCT

That is, for each variable (the COLs), there is an estimate, a margin of error, and a percentage given.

The trouble is being created by read.xlsx2, which I'm using to import the file like so:

data <- as.data.table(read.xlsx2(
  "file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L))

The problem is that read.xlsx2 assigns the same column name to a bunch of stuff--the import looks something like:

ID EST EST EST

even if I set header = FALSE, I'm apt to get something like

X1 X2 X2 X2

To circumvent this, I've done the following subsequent to import:

data[ , c("col1_est", "EST") := .(EST, NULL)]
data[ , c("col2_est", "EST") := .(EST, NULL)]
data[ , c("col3_est", "EST") := .(EST, NULL)]

This strikes me as an odd way to deal with the problem; can anyone suggest an alternate approach to this?

smci
  • 32,567
  • 20
  • 113
  • 146
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • May be you could try to convert your Excel file to csv and use `fread` of data.table to read the data into R. – KFB Jan 10 '15 at 00:04
  • 2
    It sounds like you're dealing with multi-row headings. What do you want the output to be like? – A5C1D2H2I1M1N2O1R2T1 Jan 10 '15 at 03:55
  • @KFB that crossed my mind, but I'd like to find a solution that doesn't involve that--it's not the case for me, but I could imagine someone having a large number of files formatted like mine looking for a programmatic solution – MichaelChirico Jan 10 '15 at 22:27
  • 1
    @AnandaMahto yes, that's the vocab word I was looking for! as long as I can successfully import to a data frame / data table object without ambiguously named columns, I can deal with the rest of cleaning/manipulation within R. – MichaelChirico Jan 10 '15 at 22:29

1 Answers1

3

This can be done with the check.names argument to fread, data.table, and setDT:

DT <- read.xlsx2(
  "file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L)
)
setDT(DT, check.names = TRUE)

Automatically adds .1, .2, etc. to duplicated column names.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198