0

I have a data.table where the data I want is structured in a diagonal fashion.

library(data.table)
month <- c(201406, 201406, 201406, 201406, 201406, 201406, 201406, 201406, 
201406, 201406, 201406, 201406)
code <- c("498A01", "498A01", "498A01", "498A01", "498A01", "498A01", "498A01", "498A01", 
"498A01", "498A01", "498A01", "498A01")
col.a <- c("service", "base charge", "", "", "", "", "", "", "", "", "", "")
col.b <- c("", "", "description", "per unit", "", "", "", "", "", "", "", "")
col.c <- c("", "", "", "", "rate", 6859, "", "", "", "", "", "")
col.d <- c("", "", "", "", "", "", "quantity", 1, "", "", "", "")
col.e <- c("", "", "", "", "", "", "", "", "total charge", 6859, "", "")
col.f <- c("", "", "", "", "", "", "", "", "", "", "", "")   
dt <- data.table(month, code, col.a, col.b, col.c, col.d, col.e, col.f)

However, I need to organize the data in a more coherent fashion to simplify dt I am fairly new to data.table and I was wondering if there was a straightforward way to do so.

For col.a I know the following works for one column:

dt <- dt[col.a != "", 1:8, by = .(code, month)

But when I try for multiple columns it returns a data table with 0 obs. I suppose I could do that for all of the columns and then do some kind of merge but that seems inefficient and cumbersome. Is there a better way?

My desired output is:

   month   code      col.a      col.b     col.c   col.d       col.e    col.f
1: 201406 498A01     service description   rate quantity total charge       
2: 201406 498A01 base charge    per unit   6859        1         6859 

So for each unique combination of code and month I want to remove the empty cells and collapse the data to look like it does above. I need to keep the col.f1 because it may not always be blank.

Any suggestions would be greatly appreciated.

jvalenti
  • 604
  • 1
  • 9
  • 31
  • 1
    I'm not sure I completely understand your data structure, using diagonals like this. Also, your data tables don't do anything with `month` or `code`, so I'm not sure what the entries are supposed to look like. – Anonymous coward Aug 02 '18 at 15:41
  • @Anonymouscoward that's how the data is in the data frame, this is just an example. It was all in one column before, which is even worse. it's a messy problem for sure, hence the post. I went ahead and added `month` and `code` to the data tables. – jvalenti Aug 02 '18 at 15:43
  • Ok, I think I see now, just correct me if I'm wrong. Is it that you have data where 12 rows are essentially one entry? And the next entry is the same, but for month `201407`? – Anonymous coward Aug 02 '18 at 15:49
  • @Anonymouscoward the next entry would be the same as long as the value of `code` were the same. `month` would be different, depending on how the data was sorted, `201407` or `201405`. I would imagine there is something that could be done using the `by = list(code, month)` but I am not sure how to implement that – jvalenti Aug 02 '18 at 15:55

1 Answers1

3

Are you looking for something like

dt[, lapply(.SD, function(x) x[x!=""][1:2]), by=.(month, code)]

output:

    month   code       col.a       col.b col.c    col.d        col.e col.f
1: 201406 498A01     service description  rate quantity total charge  <NA>
2: 201406 498A01 base charge    per unit  6859        1         6859  <NA>

Or in base R:

do.call(rbind, by(dt, paste(dt$month, dt$code), 
    function(y) do.call(cbind, lapply(y, function(x) x[x!=""][1:2]))))

output:

     month    code     col.a         col.b         col.c  col.d      col.e          col.f
[1,] "201406" "498A01" "service"     "description" "rate" "quantity" "total charge" NA   
[2,] "201406" "498A01" "base charge" "per unit"    "6859" "1"        "6859"         NA   
chinsoon12
  • 25,005
  • 4
  • 25
  • 35