What tools can I use to find Part Of Speech Patterns

Question

I am looking for tools to find Part Of Speech patterns on a corpus of documents. I am using the Stanford NLP tools for POS tagging my documents. Now I would like to query these tagged documents and find some specific POS patterns such as for example

NP is JJ (ex: the movie is nice)

or JJ NP (ex : excellent foie gras)

Is there a tool that can do this for me in a simple and efficient manner or do I need to write my own ?

up voted! very useful question! – Maziyar Jun 20 '17 at 10:05 — Maziyar, Jun 20 '17 at 10:05

score 2 · Accepted Answer · answered Apr 08 '15 at 09:05

2

From Stanford CoreNLP, you can also use TokensRegex to match a pattern in a list of tokens: http://nlp.stanford.edu/software/tokensregex.shtml

For example, your two patterns would be something like:

[{tag:NN}] [{word:is}] [{tag:JJ}]

[{tag:JJ}] [{tag:NN}]

(Side note, but NP is not a POS tag. Likely, really, what you want is [{tag:/N.*/}] and [{lemma:be}] to catch a broader range of cases).

answered Apr 08 '15 at 09:05

Gabor Angeli

5,729
1
18
29

Excellent. I was looking at TokensRegex at the same time you answered. I thought NP was NounPhrase but indeed .. its does not exist :) Thanks for the clarification, and thanks for the lemma trick ! Testing this right away. – azpublic Apr 08 '15 at 09:27
shouldnt 'tag' be 'pos' in your answer ? I was looking at this paper http://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf where they mention pos:"NNP" --> token POS is abc – azpublic Apr 08 '15 at 09:33
also as a side question how would I capture Noun Phrases the same way I am capturing Nouns (NN). For eample "the tomato salad was wonderful" how can I capture "tomato salad" + "was" + "wonderful" and not just "salad"+"was"+"wonderful" ? Thanks a lot ! – azpublic Apr 08 '15 at 10:13
1

There's a chance "pos" is aliased to "tag" -- a historic property of CoreNLP is that "tag" is taken to mean POS tag (I guess at one point it was the only tag). I don't think there's a simple way of capturing noun phrases that doesn't require a constituency parse, but a decent heuristic is to look for a pattern like [{tag:/[NJDC].*/}]* [{tag:/N.*/}]. This will take a (potentially empty) sequence of nouns, adjectives, determiners, or numbers; followed by a noun at the end. – Gabor Angeli Apr 14 '15 at 21:54

score 1 · Answer 2 · answered Apr 07 '15 at 18:16

1

One tool to consider is the Corpus Workbench: http://cwb.sourceforge.net/

answered Apr 07 '15 at 18:16

aab

10,858
22
38

Thanks, this tool looks great but do you know if I can use it with the stanford POS annotations (i think it's Penn Treebank). I have already annotated the corpus and ideally I would like to directly query this annotated corpus without generating a new set of annotations. Do you know if this tool will let me do this ? – azpublic Apr 08 '15 at 08:37
No, I think you'd have to convert the annotation to a different format, so the Stanford tool in Gabor's answer sounds better for your purposes. – aab Apr 13 '15 at 09:49

What tools can I use to find Part Of Speech Patterns

2 Answers2