0

I'd like to split out some data within a data frame by a specific string and count the frequency.

After toying with a few methods I've come up with a method, but there's a slight error in my results.

Example:

Data frame data file:

data
abc hello
hello
aaa
zxy
xyz

List:

list
abc
bcd
efg
aaa

My code:

lapply(list$list, function(x){
    t <- data.frame(words = stri_extract(df$data, coll=x))
    t<- setDT(t)[, .( Count = .N), by = words]
    t<-t[complete.cases(t$words)]
    result<-rbind(result,t)
    write.csv(result, "new.csv", row.names = F)
})

In this example I would expect a CSV file with the following results:

words Count
abc     1
aaa     1

However with my code I got:

words Count
aaa     1

I know stri_extract should identify abc within abc hello so perhaps the error happens when I use rbind?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Oli
  • 532
  • 1
  • 5
  • 26

1 Answers1

3

You need to move the write.csv file out of the loop, otherwise it will override the previously saved file and you will only get the file saved at the final stage. By doing that, you will have to rbind your result outside lapply, since you can't modify the result variable in the function.

result <- do.call(rbind, lapply(list$list, function(x){
                                t <- data.frame(words = stri_extract(df$data, coll=x))
                                t<- setDT(t)[, .( Count = .N), by = words]
                                t<-t[complete.cases(t$words)]
                                t
 }))

write.csv(result, "new.csv", row.names = F)
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks, very helpful – Oli Jun 01 '16 at 11:26
  • Couldn't you have write.csv within the loop with append=T? That would probably just slow down the process anyway, I only need to write once, just asking – Oli Jun 01 '16 at 15:40
  • That is a viable solution, too. You can go ahead and have a try. Not sure about the performance. – Psidom Jun 01 '16 at 23:09