-1

I understand that using regex to parse html is frowned upon, but this is the solution I want to try first.

I am trying to match

what a great sentence this is

as well as any characters or spacing that comes in between those words

in the following string:

<p>
  what is going on with you?
</p>
<p>
  what a great
</p>
<p>
  sentence this is
</p>
<p>
  How is your family?
</p>

The regex I am using is:

what.*a.*great.*sentence.*this.*?is

I know the .*? before 'is' is stopping my regex from matching up to 'How is' in the final p tag. But I cannot figure out what to put near the beginning to stop the match from starting at 'what is going on' in the first p tag

I am viewing the output from https://regex101.com/r/kZWYR7/1 to verify that it is not working as intended.

Please help, I feel there is a crucial lesson I am missing with regex that is stopping me from figuring this out.

Expected match would be:

what a great
    </p>
    <p>
      sentence this is

EDIT: Clarifying my problem and how it is different than the duplicate

BlahMclean
  • 11
  • 6
  • 1
    Try [`what\W*a\W*great\W*sentence\W*this\W*is`](https://regex101.com/r/X0zG6c/1). What kind of chars do you expect between the words? `\W` is any non-word char (whitespace included). If you only want to allow whitespace, use `\s` instead of `\W`. – Wiktor Stribiżew Aug 07 '18 at 19:41
  • 3
    What is the reason, you're using `.*` between your words? – anubhava Aug 07 '18 at 19:42
  • 1
    You can match `what a great sentence this is` using the regex `what a great sentence this is` – that other guy Aug 07 '18 at 19:42
  • In addition to the above comments, I'd like to recommend https://regexr.com/ to build your regex patterns, as it also has a cheat sheet and will help to show the specific logic of each part of your expression – Thorin Jacobs Aug 07 '18 at 19:44
  • @anubhava I am just trying to find those consecutive words in a sentence, no matter what whitespace or other characters are between the words. I am using DOTALL with this so the . also catch whitespace of any kind. Its just that too much is being matched on the left side. – BlahMclean Aug 07 '18 at 19:50
  • @WiktorStribiżew Would you like to post that as an answer, so I can upvote it please? (although you might consider changing `*` to `+`). – Dawood ibn Kareem Aug 07 '18 at 19:51
  • @that other guy this is just an example sentence. the real list of words to be matched is dynamic. I am basically making a regex to search a file for specific words and get the string containing all of them, but nothing more – BlahMclean Aug 07 '18 at 19:52
  • 1
    Possible duplicate of [Stack overflow when trying to use regex in java](https://stackoverflow.com/q/51715164/5221149). *Note:* Answer also covers the question posted here. – Andreas Aug 07 '18 at 19:53
  • @WiktorStribiżew that solution worked great! – BlahMclean Aug 07 '18 at 19:55
  • So Between `What` and `a` there can be other words as well? – anubhava Aug 07 '18 at 19:55
  • @anubhava yes there can be – BlahMclean Aug 07 '18 at 19:57
  • 1
    In that case try this: `what[^.?]*?a[^.?]*?great[^.?]*?sentence[^.?]*?this[^.?]*?is` – anubhava Aug 07 '18 at 19:58
  • @BlahMclean you seem to want to go back to using reluctant qualifier again? you can use it for fist and last, but then there could be the performance hit of course. This question doesn't provide reasoning why it wasn't sufficiently answered in the duplicate. – Patrick Parker Aug 07 '18 at 20:03

1 Answers1

0

To match plain text to a part of a string in regex, just use that text you are looking for. Matching what a great sentence this is should work, no need for the .*. The .* after the what allows the rest of the string to be anything until is.

EDIT: I just read through your comments and saw that there is a possibility for whitespace between the words. In that case, @WiktorStribiżew is right, use \W* between each word to accomodate for any amount of non word chars between words. (thank you again @WiktorStribiżew)

As @Jonathan Buelow pointed out, if it is just whitespace between words, you can use \s+ or \s* instead: what\s+a\s+great\s+sentence\s+this\s+is

gkgkgkgk
  • 707
  • 2
  • 7
  • 26