Import in R with column headers across 3 rows. Replace missing with latest non-missing column

Question

I need help importing data where my column header is split across 3 rows, with some header names implied. Here is what my xlsx file looks like

1                         USA                             China
2                         Dollars         Volume          Dollars           Volume
3  Category   Brand       CY2016  CY2017  CY2016  CY2017  CY2016   CY_2017  CY2016   CY2017
4  Chocolate  Snickers    100     120     15      18      100      80       20       22
5  Chocolate  Twix        70      80      8       10      75       50       55       20

I would like to import the data into R, except I would like to retain the headers in rows 1 & 2. An added challenge is that some headers are implied. If a header is blank, I would like it to use the cell in the column to the left. An example of what I'd like it to import as.

1  Category   Brand       USA_Dollars_CY2016  USA_Dollars_CY2017  USA_Volume_CY2016  USA_Volume_CY2017  China_Dollars_CY2016   China_Dollars_CY_2017  China_Volume_CY2016   China_Volume_CY2017
2  Chocolate  Snickers    100                 120                 15                 18                 100                    80                     20                    22
3  Chocolate  Twix        70                  80                  8                  10                 75                     50                     55                    20

My current method is to import, skipping rows 1 & 2 and then just rename the columns based on known position. However, I was hoping code existed to that would prevent me from this step. Thank you!!

mrbrich · Accepted Answer · 2018-01-07T23:32:07.973

2

I will assume that you have saved the xlsx data in .csv format, so it can be read in like this:

header <- read.csv("data.csv", header=F, colClasses="character", nrow=3)
dat <- read.csv("data.csv", header=F, skip=3)

The tricky part is the header. This function should do it:

construct_colnames <- function(header) {
    f <- function(x) {
        x <- as.character(x)
        c("", x[!is.na(x) & x != ""])[cumsum(!is.na(x) & x != "") + 1]
    }
    res <- apply(header, 1, f)
    res <- apply(res, 1, paste0, collapse="_")
    sub("^_*", "", res)
}
colnames(dat) <- construct_colnames(header)
dat

Result:

   Category    Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016
1 Chocolate Snickers                100                120                15                18                  100
2 Chocolate     Twix                 70                 80                 8                10                   75
  China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
1                    80                  20                  22
2                    50                  55                  20

edited Jan 07 '18 at 23:32

answered Jan 06 '18 at 05:16

mrbrich

853
1
8
9

header <- read_excel("data.xlsx", sheet = "Choc", col_names=F, col_types="text", range=cell_rows(1:3)) dat <- read_excel("data.xlsx", sheet = "Choc", col_names=F, skip=3) – Derek Lilley Jan 06 '18 at 14:06
Hi mrbrich, my current data is still in an xlsx, as I have multiple sheets to read in. While your code does work if my file is saved as a csv, ideally I could get it to work while reading in a xlsx file. I quickly wrote the above and while it pretty much mirrors your original dat & header dataframes, your construct_colnames function only works from a read.csv imports? Any chance you could me again? Thanks a ton!! – Derek Lilley Jan 06 '18 at 14:10
It should work just the same, but it might depend on how read_excel converts the empty cells. I modified the function so it should work if they are NA. – mrbrich Jan 07 '18 at 23:36
Could you tell me if it works now and if so, maybe accept the answer? – mrbrich Jan 15 '18 at 07:58

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

1 Answers1