0

I have a data frame as the following:

str(data)
'data.frame':   255 obs. of  3 variables:
$ Group      : Factor w/ 255 levels "AlzGroup1","AlzGroup10",..: 1 112 179 190 201 212 223 234 245 2 ...
$ Gender     : int  1 1 0 0 0 0 0 1 0 0 ...
$ Description: Factor w/ 255 levels "A boy's on the uh falling off the stool picking up cookies . The girl's reaching up for it . The girl the lady "| __truncated__,..: 63 69 38 134 111 242 196 85 84 233 ...

in the Description column I have 255 speeches and I want to add a column to my data frame containing number of verbs in each speech, I know how to get number of verbs but the following code gives me total number of verbs in Description column:

> library(NLP);
> library(tm);
> library(openNLP);
NumOfVerbs=sapply(strsplit(as.character(tagPOS(data$Description)),"[[:punct:]]*/VB.?"),function(x) {res = sub("(^.*\\s)(\\w+$)", "\\2", x); res[!grepl("\\s",res)]} )

Does anyone know how can I get number of verbs in each speech?

Thanks for any help!

Elahe

ch.elahe
  • 289
  • 4
  • 18
  • If you can count the verbs then you can also use dplyr::group_by to group by speech and summarise(n()) to count. I think you might get better quality help if you post a reproducible example rather than the structure of your data. Just use `dput(data)` and paste the output here. – biomiha Oct 29 '17 at 19:14

1 Answers1

0

Assuming you are using function similar to this one (found here: could not find function tagPOS):

tagPOS <-  function(x, ...) {
  s <- as.String(x)
  word_token_annotator <- Maxent_Word_Token_Annotator()
  a2 <- Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- annotate(s, word_token_annotator, a2)
  a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
  POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  list(POStagged = POStagged, POStags = POStags)
}

Create a function that counts the number of POS tags that contain the letters 'VB'

count_verbs <-function(x) {
  pos_tags <- tagPOS(x)$POStags
  sum(grepl("VB", pos_tags))
  }

And use dplyr to group by Group and summarise using count_verbs():

library(dplyr)
data %>% 
  group_by(Group) %>%
  summarise(num_verbs = count_verbs(Description))
clemens
  • 6,653
  • 2
  • 19
  • 31
  • I cannot run the very last part of your answer summarise(num_verbs = count_verbs(data$Description)), I get this error: Error in summarise_(.data, .dots = compat_as_lazy_dots(...)) : argument ".data" is missing, with no default – ch.elahe Oct 30 '17 at 10:50
  • That's because you pass `data$Description` as an argument to `count_verbs()`, if you look at the chain above, you will find I only pass `Description`. – clemens Oct 30 '17 at 14:38
  • you're right, then using what you did in the chain brings a new error!Error in summarise_impl(.data, dots) : Evaluation error: java.lang.OutOfMemoryError: Java heap space, do you have any idea how we can use this solution NumOfVerbs=sapply(strsplit(as.character(tagPOS(data$Description)),"[[:punct:]]*/VB.?"),function(x) {res = sub("(^.*\\s)(\\w+$)", "\\2", x); res[!grepl("\\s",res)]} ) to count verbs in each speech? – ch.elahe Oct 30 '17 at 15:41
  • You can restart R, and use this `options(java.parameters = "-Xmx8g" )` (example for 8 GB, you can in/decrease it depending on your machine) to increase the memory that can be used. You can also try to explicitly call a garbage collector `gc()` in `tagPOS()` to free up memory. – clemens Oct 30 '17 at 15:48
  • Thanks a lot! this was exactly what I was looking for! – ch.elahe Oct 30 '17 at 16:50