I'm trying to make a regex that will match the _TAG (_DT, _NN, etc) part, in the order they appear, of all of the following variations of a sentence:
Her_PP$|DT noun_NN|NNS a_PP$|DT noun_NN|NNS
Her_PP$|DT adj_JJ noun_NN|NNS a_PP$|DT noun_NN|NNS
Her_PP$|DT noun_NN|NNS a_PP$|DT adj_JJ noun_NN|NNS
Her_PP$|DT adj_JJ noun_NN|NNS a_PP$|DT adj_JJ noun_NN|NNS
This is the way the original text appears, and I am simply trying to highlight matches. The vertical bar | does mean "or" in context, so please include it in the regex like a normal "or."
As you can see, the basic skeleton of each of these is Her_PP$|DT noun_NN|NNS a_PP$|DT noun_NN|NNS, with some _JJ pieces scattered throughout. I want one regex to catch all of these, but I cannot seem to make one with optional strings that work.
_PP$|DT
is not always followed by _JJ
, so I wanted to set _JJ
as optional, but it's finding it every time and never finding _PP$|DT _NN
or _PP$|DT _JJ _NN.
Here is my string:
(\w+_(?:PP\$|DT)(?:\w+_JJ)(\w+__(?:NN|NNS)))
For those who care to know, the _PP$
etc are part of speech tags that are appended to the ends of words (for example, NN means "noun" so you might see "dog_NN").
I apologize that I'm an absolute beginner at this, so please be patient! :)