0

I have the following data

    path     value
1 b,b,a,c     3
2     c,b     2
3       a    10
4 b,c,a,b     0
5     e,f     0
6     a,f     1

df

df <- data.frame (path= c("b,b,a,c", "c,b", "a", "b,c,a,b" ,"e,f" ,"a,f"), value = c(3,2,10,0,0,1))

I wish to compute the total number that I do not have a factor and the the value is not zero. So my desired output will be:

 #desiored output
    path value
1:    b     2
2:    a     1
3:    c     2
4:    e     4
5:    f     3

For instance, for a it shows the total number that we do not have a and the value is not zero is equal to 1. Only one time in row 2 we do not have a and the value is not zero. (hope it is clear, please let me know if more example is required)

I tried the following code but the out put for b is wrong. Does anyone know why?

total <- sum(df$value != 0)

library (splitstackshape)

#total number of total minus total number that a value is not zero 

output <-cSplit(df, "path", ",", 'long')[, .(value=total - sum(value!=0)), .(path)]

output

This code results in the following output which is not correct for b

path value
1:    b     1
2:    a     1
3:    c     2
4:    e     4
5:    f     3
MFR
  • 2,049
  • 3
  • 29
  • 53

1 Answers1

1

Read the factors into facs and then use grep them out and count:

facs <- unique(scan(textConnection(as.character(df$path)), what = "", sep = ","))
data.frame(path = facs, 
           value = colSums( !sapply(facs, grepl, as.character(df$path)) & df$value != 0 ))

giving:

  path value
b    b     2
a    a     1
c    c     2
e    e     4
f    f     3
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks @G. Grothendieck it is correct, However, does not seesm the most efficient way and it is too slow. Do you have any idea with what is wrong with my code? – MFR Nov 12 '16 at 02:44
  • I mean yours is correct, but I posted this question with hope that someone tell me what is wrong with my code rather than provide a solution. BTW I appreciate your nice idea for solving this question and accept it. – MFR Nov 13 '16 at 03:54
  • The code in the question does not keep track of the original row that each row in the long data table came from so it can't be correct. – G. Grothendieck Nov 13 '16 at 04:52