2

I am trying to use REGEX to split a string apart while maintaining the delimeters. I wish to split a very large and unpredictable string apart via anchor tags. I am using HTML tidy to ensure the tags are correct, however anything could come before or after the anchor tag I wish to match.

*PRECEDING-ANYTHING*<a *ANYTHING*>*ANYTHING*</a>*PROCEDING-ANYTHING*
*PRECEDING-ANYTHING*<a *ANYTHING*>*ANYTHING*</a>*PROCEDING-ANYTHING*

where the href URL could be anything and additional attributes such as 'target' could also be anything.

I've done a lot of searching and testing and either I am doing something wrong or the other answers on Stack Overflow do not apply.

Using

$parts= preg_split($pattern, $textWithAnchors, -1, PREG_SPLIT_DELIM_CAPTURE) 

I was hoping to have $parts be similar to the following.

parts[0] is equal to *PRECEDING-ANYTHING*
parts[1] is equal to <a *ANYTHING*>*ANYTHING*</a>
and so forth

It is important that the regular expression capture the entire anchor tags and everything inside.

I would very much appreciate any help, I'm asking specifically for a regular expression that will accomplish this in PHP. I am aware that there are HTML parsers however, using REGEX is optimal in this situation. Maybe it will be a learning experiance though.

  • 3
    Please look at the `DOMDocument` class, it's much more hands-on and supports loading partial code. You could easily load one of your lines and let it find all `a`-tags. – Cobra_Fast Nov 14 '13 at 15:26
  • See the first answer here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – elixenide Nov 14 '13 at 15:28
  • Show some of the regexen you've tried. – Andrew Cheong Nov 14 '13 at 15:30
  • Thanks for the comments. I have done a lot of reading on those other methods and at the moment to the best of my knowledge regular expressions are still the best way to complete this task. I have seen this sort of operation successfully done with PHP regular expressions before and with to complete it via regex. I understand that this may be a dirty way of doing it, but in that case it will be a learning experiance. Also, I wanted to point out that the example I used above are not just lines, the *ANYTHING* text could be paragraphs, other HTML, chars, line breaks etc. – user2992699 Nov 14 '13 at 15:36
  • You really dont want to use regular expressions for this. But here goes /.*().*/i – Jesper Blaase Nov 14 '13 at 15:37
  • If it isn't too much work, I would appreciate a simple answer of what regular expression pattern would match the entire anchor tag. You may all then sit back and laugh at me as I attempt to get it to work if that helps ;p – user2992699 Nov 14 '13 at 15:37
  • Thank you very much, and sorry if I came off arrogant or pushy. I truly do appreciate the help, in a few minutes I got the answer. Cheers – user2992699 Nov 14 '13 at 15:44
  • Also for future reference, /.*().*/i should be /.*().*/i otherwise an unkown modifier error occurs in PHP. – user2992699 Nov 14 '13 at 16:07
  • make the `a.*` an `a.*?` too – OGHaza Nov 14 '13 at 16:51

1 Answers1

1

Using PREG_SPLIT_DELIM_CAPTURE won't help you, because that returns text captured in group 1 of the delimiter regex as a separate element, but you want the delimiters to be included with the elements.

To specify delimiters that don't consume input, use regex look arounds.
This code does the job:

$parts= preg_split('/(?=<a)|(?<=\/a>)/', $textWithAnchors);

It's splitting using a look-ahead for the open tag, an da look behind for the closing tag.

See a live demo of this code splitting your example as required.

Bohemian
  • 412,405
  • 93
  • 575
  • 722