This question is similar to this, but it's got a C# answer, and I need a R answer.
I have some 50 files of about 650 rows with a format and data very similar to this toy data:
dput(y)
structure(list(level1 = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L), level2 = c(NA, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
42L, 42L, 42L, 42L), level3 = c(NA, NA, 4120L, 4120L, 4120L,
4120L, 4120L, 4120L, NA, 4210L, 4210L, 4210L), level4 = c(NA,
NA, NA, 412030L, 412030L, 412050L, 412050L, 412050L, NA, NA,
421005L, 421005L), pid = c(NA, NA, NA, NA, 123456L, NA, 789012L,
345678L, NA, NA, NA, 901234L), description = c("income", "op.income",
"manuf.industries", "manuf 1", "client 1", "manuf 2", "client 2",
"client 3", "non-op.income", "financial", "interest", "bank 1"
), value = c(NA, NA, NA, NA, 15000L, NA, 272860L, 1150000L, NA,
NA, NA, 378L)), .Names = c("level1", "level2", "level3", "level4",
"pid", "description", "value"), class = c("data.table", "data.frame"
), row.names = c(NA, -12L), .internal.selfref = <pointer: 0x00000000001a0788>)
Each of the rows that have a value on value
are a "leaf" o a tree, with branches identified in columns level
1 to 4. I want to summarize the leafs by brach and put the corresponding values in the value
column.
My expected output looks like this:
dput(res)
structure(list(level1 = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L), level2 = c(NA, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
42L, 42L, 42L, 42L), level3 = c(NA, NA, 4120L, 4120L, 4120L,
4120L, 4120L, 4120L, NA, 4210L, 4210L, 4210L), level4 = c(NA,
NA, NA, 412030L, 412030L, 412050L, 412050L, 412050L, NA, NA,
421005L, 421005L), pid = c(NA, NA, NA, NA, 123456L, NA, 789012L,
345678L, NA, NA, NA, 901234L), description = c("income", "op.income",
"manuf.industries", "manuf 1", "client 1", "manuf 2", "client 2",
"client 3", "non-op.income", "financial", "interest", "bank 1"
), value = c(1438238L, 1437860L, 1437860L, 15000L, 15000L, 1422860L,
272860L, 1150000L, 378L, 378L, 378L, 378L)), .Names = c("level1",
"level2", "level3", "level4", "pid", "description", "value"), class = c("data.table",
"data.frame"), row.names = c(NA, -12L), .internal.selfref = <pointer: 0x00000000001a0788>)
I know this can be done with a for-loop, but I wanted to know if there is any faster, simpler alternative (I prefer data.table
or base-solutions, but any other package works ok too). What I've tried so far:
z4<-y[!is.na(pid),sum(value),by=level4]
setkey(y,"level4");setkey(z4,"level4")
y[z4,][is.na(pid)]
This shows me the desired values in V1
, so I wanted to see if I could assign them to value
:
y[z4,][is.na(pid),value:=i.V1]
Error in eval(expr, envir, enclos) : object 'i.V1' not found
I think this could be caused because the call i.V1
is in the chained [
and not in the initial y[z4
call. But if I only subset on z4
, how can I know which of the several matching level4
rows I should assign (that's why I'm thinking of using is.na(pid)
, because y[z4,value:=i.V1]
produces the wrong result, as it updates all values that match level4
).
As you can see, I'm badly stuck at this problem, and with "my method" I still would have 3 more levels to go.
Is there any easier way to do this?