-2

I'm trying to make a regex that will match the _TAG (_DT, _NN, etc) part, in the order they appear, of all of the following variations of a sentence:

Her_PP$|DT noun_NN|NNS a_PP$|DT noun_NN|NNS
Her_PP$|DT adj_JJ noun_NN|NNS a_PP$|DT noun_NN|NNS
Her_PP$|DT noun_NN|NNS a_PP$|DT adj_JJ noun_NN|NNS
Her_PP$|DT adj_JJ noun_NN|NNS a_PP$|DT adj_JJ noun_NN|NNS

This is the way the original text appears, and I am simply trying to highlight matches. The vertical bar | does mean "or" in context, so please include it in the regex like a normal "or."

As you can see, the basic skeleton of each of these is Her_PP$|DT noun_NN|NNS a_PP$|DT noun_NN|NNS, with some _JJ pieces scattered throughout. I want one regex to catch all of these, but I cannot seem to make one with optional strings that work.

_PP$|DT is not always followed by _JJ, so I wanted to set _JJ as optional, but it's finding it every time and never finding _PP$|DT _NN or _PP$|DT _JJ _NN. Here is my string:

(\w+_(?:PP\$|DT)(?:\w+_JJ)(\w+__(?:NN|NNS)))   

For those who care to know, the _PP$ etc are part of speech tags that are appended to the ends of words (for example, NN means "noun" so you might see "dog_NN").

I apologize that I'm an absolute beginner at this, so please be patient! :)

Toto
  • 89,455
  • 62
  • 89
  • 125
  • 2
    A non-capturing group is not *optional*. You have to add a `?` for that. – Biffen Apr 27 '15 at 14:54
  • Add that where? I've tried adding a ? after, etc and it still doesn't ignore it. – Carrie Ott Apr 27 '15 at 14:55
  • Please format your post. – Casimir et Hippolyte Apr 27 '15 at 14:56
  • Thank you so much whoever formatted my post to look nicer! – Carrie Ott Apr 27 '15 at 15:06
  • Are you looking for the literal string `_PP$|DT`? Because the `|` has meaning in a regex and you would need to escape it. – dawg Apr 27 '15 at 15:29
  • No, I'm looking for either PP$ or DT. I've been working away even since asking this question, and I've got `(\w+_(PP\$|DT) (.*?\w+_JJ|.*?) (\w+_(NN|NNS))` now, which seems to do the optional thing that I wanted. Now it's just not letting me put things after the NNS set to continue with more strings (I want to find another set of PP$|DT afterward). If I do that, it makes my _JJ string not optional again. – Carrie Ott Apr 27 '15 at 15:31
  • I am still confused: Does the target text actually have `Her_PP$|DT` and you want the capture group to have that OR does the target text have one or the other of `Her_PP$` OR `Her_DT`? – dawg Apr 27 '15 at 16:14
  • So sorry to be confusing! The target group might have Her_PP$ or An_DT, but no matter what comes before the underscore, what follows it in that spot in the sentence will always be _PP$ or _DT. There is no way to know what will be before the _ though. – Carrie Ott Apr 27 '15 at 16:22

3 Answers3

0

(?:xyz) means match xyz but dont capture it.

If you want to make something optional you have to add ? to that group.

In your case.. regex would be

((\w+_(PP\$|DT)(\s\w+_JJ)*?(\s\w+_(NN|NNS))\s?){2})
karthik manchala
  • 13,492
  • 1
  • 31
  • 55
  • Thanks! But when I use this, it only finds PP$|DT followed by JJ OR PP$|DT followed by NN|NNS, and never PP$|DT followed by JJ followed by NN|NNS. Looking back, I see that I never specified that I wanted to do that in my question. Can you help me with that? (I'll go back and add it to the question). So basically I want it to find all phrases where the JJ is there AND all phrases where it's not at the same time. Sorry for the confusion. – Carrie Ott Apr 27 '15 at 15:15
  • Wait, I think I've got something for finding a full phrase (one more set of PP$|DT and NN|NNS with optional JJ after your regex ends, which would be the most ideal version of this regex). How's `/(\w+_(PP\$|DT)\s*(\w+_JJ)?\s*(\w+_(NN|NNS))\s*(\w+_(PP\$|DT)\s*))\s*(\w+_JJ)?\s*(\w+_(NN|NNS))?/g` looking? It seems to work on my end when I try it in programs like New Fiddle, but when I input it into the actual program I'm working with (Antconc), it says there are no hits when I know that there are. What do you think could be making this happen? – Carrie Ott Apr 27 '15 at 19:52
  • Thanks! Is there an alternative to the \s*? I think those are what my program doesn't like and that's why it's saying there are no hits. Any ideas? – Carrie Ott Apr 29 '15 at 11:01
  • Oh duh. :) Thanks! One more thing and then I promise I'll leave you alone! Is there a way to tell this regex to change the optional JJ to "an unknown number of optional JJ but as few as possible"? I used to do something like that with something like this in an old (and now obsolete) regex: `[\w-]+_(?:[\w-]+\W+)*?\W+` – Carrie Ott Apr 29 '15 at 14:17
0

You can use lookaheads to test the various conditions:

^(?=.*_PP\$|DT)(?=(?:.*_JJ)?)

^                               start of string
    ^^^^                        First condition
                  ^^^^^         Optional second condition

Then capture everything up to _NN:

^(?=_PP\$\|DT)(?=(?:.*_JJ)?)(.*_NN)

Demo

dawg
  • 98,345
  • 23
  • 131
  • 206
  • So say I was trying to create one string to catch all of the following: _PP$|DT _NN|NNS _PP$|DT _NN|NNS; or _PP$|DT _JJ _NN|NNS _PP$|DT _NN|NNS; or _PP$|DT _NN|NNS _PP$|DT _JJ _NN|NNS; or _PP$|DT _JJ _NN|NNS _PP$|DT _JJ _NN|NNS; Do you see what I'm trying to do here? I'm trying to capture all of those possibilities with one regex. – Carrie Ott Apr 27 '15 at 15:38
  • Are those on separate lines or delimited by `;` as you have them? – dawg Apr 27 '15 at 15:40
  • Yes, there are four different ones that I tried to separate for you using a semicolon! Sorry, I wanted to put them each on their own line, but comments wouldn't let me. And for actual text, do you mean her_PP$ adj_JJ noun_NN a_DT noun_NN? I can write out an example of each situation if that's what you need. – Carrie Ott Apr 27 '15 at 15:41
  • Yes, update your question and do not use regex metacharacters in the examples unless they are actually in the target text. – dawg Apr 27 '15 at 15:44
  • her_PP$ adj_JJ noun_NN a_DT noun_NN is an exact example of what my text looks like in the document I'm searching. That's what you want me to add to my question? I'm sorry, I feel stupid for having to clarify even really basic things. – Carrie Ott Apr 27 '15 at 15:46
  • Don't feel stupid! It is just difficult to understand what is actual target characters vs what you are trying to convey. Some good examples will clear it up. – dawg Apr 27 '15 at 15:48
  • Thanks! :) I'll try to make that clearer in my question. Give me a sec and I'll change it. Then I'll be working on this again at around 3:30EST today, so if you leave an answer, I'll get it then! – Carrie Ott Apr 27 '15 at 15:49
0

Your regex isn't so bad, just escape the pipe | because it's special character in a regex:

(\w+_(?:PP\$\|DT)(?:\w+_JJ)(\w+__(?:NN\|NNS)))
//   here __^              and here __^
Toto
  • 89,455
  • 62
  • 89
  • 125
  • `This is the way the original text appears, and I am simply trying to highlight matches. The vertical bar | does mean "or" in context, so please include it in the regex like a normal "or."`.. According to OP – karthik manchala Apr 27 '15 at 19:10