3

I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-search).

For instance, a search pattern could be: 'NOUNPHRASE be|is|was ADJECTIVE than NOUNPHRASE', and should return all strings containing structures like: "a cat is faster than a dog".

I know that packages like openNLP and qdap offer convenient POS-tagging. Has anyone been using the output of it for this kind of pattern maching ?

lawyeR
  • 7,488
  • 5
  • 33
  • 63
nassimhddd
  • 8,340
  • 1
  • 29
  • 44

1 Answers1

2

As a starter, using koRpus and TreeTagger:

library(koRpus) 
library(tm)
mytxt <- c("This is my house.", "A house is better than no house.", "A cat is faster than a dog.")
pattern <- "Noun, singular or mass.*?Adjective, comparative.*?Noun, singular or mass"

tagged.results <- treetag(file = mytxt, treetagger="C:/TreeTagger/bin/tag-english.bat", lang="en", format="obj", stopwords=stopwords("en")) 
tagged.results <- kRp.filter.wclass(tagged.results, "stopword")
taggedText(tagged.results)$id <- factor(head(cumsum(c(0, taggedText(tagged.results)$desc == "Sentence ending punctuation")) + 1, -1))

setNames(mytxt, grepl(pattern, aggregate(desc~id, taggedText(tagged.results), FUN = paste0)$desc))
#               FALSE                               TRUE                               TRUE 
# "This is my house." "A house is better than no house."      "A cat is faster than a dog."
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • thanks, that's quite nice. Is there a way to mix POS-tags and actual words ? For instance, if I want only cats as the first noun: "cat*?Adjective, comparative.*?Noun, singular or mass" ? – nassimhddd Mar 30 '15 at 09:23
  • 1
    Well, I guess that's possible, too. One strategy could be to concatenate words and wordclasses, before applying the regular expression. – lukeA Mar 30 '15 at 09:29