1

As always, I'm the worst regex maker in the world. But this time I really tried.

So my goal is to make one regex, that handles search related stuff. Search queries might be something like that:

  • stack overflow
  • "stackoverflow"
  • title="stack overflow"
  • type:image title=stack overflow
  • stackoverflow type:image
  • status:closed type:image title:stack overflow

But it should be able to detect them separately.. And it should be able to detect the quotations for direct match.. Only the title, has to have the search query behind it, but other conditions could be in whatever order.

And now I'm so stuck.. I was managed to do this regex. It works for only status:closed type:image title:stack overflow. The dots between the () thingis makes it work. If I replace it with |, then I get the first part matching. But Getting this to work, with all possible query formats is impossible to me.

/(?:(?:status[:](closed|open)).(?:type[:](image|video)).(?:(?:title|author|actor|movie)[:](.+)))/i

Here is the tool, I tried to do this all: http://regexr.com/39an1 my scribble is in there too.

This is for a search engine type thingy. So I hope the outcome from the match, is easy to use inside PHP. Also, I think somebody could very much benefit from this, it would have an solution.

If someone could point me to the right away, with at least the dots vs. | between the main () thingies. It feels like | = or, but I want kinda like and-or thing.

Unihedron
  • 10,902
  • 13
  • 62
  • 72
Kalle H. Väravas
  • 3,579
  • 4
  • 30
  • 47
  • Please define "stuck" – Bohemian Aug 13 '14 at 23:24
  • @Bohemian I edited it a little. Does it matches the sites format better? I'm very tired at the moment, I could rewrite it tomorrow better. But I need to finish the script before that :/ Title is bad too, but I didn't know how else to put it. Sorry, I'm just very tired and on a deadline. – Kalle H. Väravas Aug 13 '14 at 23:40
  • You're using axis notation, so why not just use a [Lucene engine](http://lucene.apache.org)? – bishop Aug 13 '14 at 23:47
  • @bishop Thanks of the resource. But for this project, its a small php search function, everything else is done. I just need the regex part.. And thats where I am as useful, as wet potato. – Kalle H. Väravas Aug 13 '14 at 23:49
  • 1
    Well "getting this to work for *all* possible formats" (emphasis mine) may be so complicated as to be a nightmare, or maybe even impossible. Regex isn't a parser, which is what you want here "for all possible formats". You might be able to do it in several passes, where you cut out known stable items (like "type:image"), then whatever left you default to title. But doing it in *one* pass for *all* possible... I don't know. – bishop Aug 13 '14 at 23:53

1 Answers1

8

See the regex:

/^(?=.*status[:=](\S+)|)(?=.*type[:=](\S+)|)(?:.*?title[:=])?(?|"([^"]+)"|((?:(?!\s?(?:type|status)).)+))[^"]*$/

You can extract the information with capturing groups.

Here is a regex demo!

Expression explanation:

  • ^ Asserts position at start of string.
  • (?= Positive lookahead - Asserts the following match within our match:
    • .* Something, then:
    • status[:=] Character sequence "status", followed by ":" or "=".
    • (\S+) Capturing Group - Next non-whitespace sequence.

If you like to provide an optional whitespace to this capturing group: To allow both status: false and status:false, then change this group and the same group downstairs to (\s?\S+)!

    • | OR
    • Nothing. This means it's OK for the alternative to be absent, only we wouldn't capture anything.
  • )
  • (?=.*type[:=](\S+)|) Try to understand this group, it's same as the one above.
  • (?:.*?title[:=])? Optional match: Try to capture "title" followed by ":" or "=" anywhere within this string. If it's present, move the pointer to this position, otherwise backtrack and fail this group.
  • (?| Branch reset - Use the same capturing group IDs for the following alternations:
    • "([^"]+)" If our pointer location matches a quote, attempt to match everything within it up to the next quote. Capturing Group: This captures everything within them and finishes the branch reset group.
    • | OR
    • ( Opens a Capturing Group.
      • (?: A group.
        • (?! Negative lookahead - Asserts that the following is not:
          • \s?(?:type|status)) An optional whitespace followed by "type" or "sequence".
          • . Then, match a character.
        • )+ Repeat until there's no more.
    • )) Closes the two groups.

(Theoretically, the following elements are redundant.)

  • [^"]* Eats the rest of the line. It doesn't really matter at this point.
  • $ Asserts position at end of String.

The \n in the demo was there due to multiline elements. For your actual use you won't have it.

Unihedron
  • 10,902
  • 13
  • 62
  • 72