2

I am looking for tools to find Part Of Speech patterns on a corpus of documents. I am using the Stanford NLP tools for POS tagging my documents. Now I would like to query these tagged documents and find some specific POS patterns such as for example

NP is JJ (ex: the movie is nice)

or JJ NP (ex : excellent foie gras)

Is there a tool that can do this for me in a simple and efficient manner or do I need to write my own ?

azpublic
  • 1,404
  • 4
  • 20
  • 42

2 Answers2

2

From Stanford CoreNLP, you can also use TokensRegex to match a pattern in a list of tokens: http://nlp.stanford.edu/software/tokensregex.shtml

For example, your two patterns would be something like:

[{tag:NN}] [{word:is}] [{tag:JJ}]

[{tag:JJ}] [{tag:NN}]

(Side note, but NP is not a POS tag. Likely, really, what you want is [{tag:/N.*/}] and [{lemma:be}] to catch a broader range of cases).

Gabor Angeli
  • 5,729
  • 1
  • 18
  • 29
  • Excellent. I was looking at TokensRegex at the same time you answered. I thought NP was NounPhrase but indeed .. its does not exist :) Thanks for the clarification, and thanks for the lemma trick ! Testing this right away. – azpublic Apr 08 '15 at 09:27
  • shouldnt 'tag' be 'pos' in your answer ? I was looking at this paper http://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf where they mention pos:"NNP" --> token POS is abc – azpublic Apr 08 '15 at 09:33
  • also as a side question how would I capture Noun Phrases the same way I am capturing Nouns (NN). For eample "the tomato salad was wonderful" how can I capture "tomato salad" + "was" + "wonderful" and not just "salad"+"was"+"wonderful" ? Thanks a lot ! – azpublic Apr 08 '15 at 10:13
  • 1
    There's a chance "pos" is aliased to "tag" -- a historic property of CoreNLP is that "tag" is taken to mean POS tag (I guess at one point it was the only tag). I don't think there's a simple way of capturing noun phrases that doesn't require a constituency parse, but a decent heuristic is to look for a pattern like [{tag:/[NJDC].*/}]* [{tag:/N.*/}]. This will take a (potentially empty) sequence of nouns, adjectives, determiners, or numbers; followed by a noun at the end. – Gabor Angeli Apr 14 '15 at 21:54
1

One tool to consider is the Corpus Workbench: http://cwb.sourceforge.net/

aab
  • 10,858
  • 22
  • 38
  • Thanks, this tool looks great but do you know if I can use it with the stanford POS annotations (i think it's Penn Treebank). I have already annotated the corpus and ideally I would like to directly query this annotated corpus without generating a new set of annotations. Do you know if this tool will let me do this ? – azpublic Apr 08 '15 at 08:37
  • No, I think you'd have to convert the annotation to a different format, so the Stanford tool in Gabor's answer sounds better for your purposes. – aab Apr 13 '15 at 09:49