Determine if position in html string is within an anchor tag

Question

I am currently using preg_match_all() to find all words that begin with a specific of preffix. For example, if the preffix is cat, then catsup would be considered a match whereas housecat would not.

Once these instances and their offsets are found, I am cycling through them and essentially encapsulating them with an anchor tag.

(Question Continued Below Code)

//Escape all non-standard characters
$preffix = sanitizePreffix($part['modlnoPreffix']);

//All Words Starting with preffix string
$pattern = "/".$preffix.'/'; 

//Find Matches
preg_match_all($pattern , $item['body'], $matches,PREG_OFFSET_CAPTURE);
$matches = array_reverse($matches[0]);

if (count($matches)>0){
    foreach ($matches as $match){
        $text = $match[0];
        $offset = (int)$match[1];
        $endOffset = $offset + strlen($text);
        $url = "/specsheet_getPreffixParts.php?m=".urlencode($text);

        //Insert ending </a> Tag                    
        $item['body'] = str_insert('</a>', $item['body'], $endOffset);

        //Insert Starting <a ...> Tag
        $item['body'] = str_insert("<a rel='".$url."' href='javascript:void(0);'>", $item['body'], $offset);
    }
}

The one catch is that I need to check each resulting index to make sure that

The result is not already linked like <a href='...'>catsup</a>
The result is not within the starting <a> tag itself like <a href='/part/catsup'> ... </a>

I'm sure I could easily create a function that would step backwards one character at a time searching for <a and then step forward one character at a time looking for </a>, but this seems a bit silly to me.

My question is: Is there a better way to do this? My initial instinct is to make this part of the initial search pattern used by preg_match_all - in other words ....

How would I find all words that start with 'cat' but are not located between a '<a' and a '</a>'

@Fred I noticed that as well. I am modifying someone elses (usually stellar) code. I'll do a Search + Replace later on. Thanks! :) — Dutchie432, May 22 '13 at 15:10
@STTLCU I actually came across that in my searches. I'm not really trying to "Parse HTML" - This is a basic text search that does not need to "Understand the complexities of HTML" - since I already do :) Plus, the HTML is super basic, since I am on control of it the entire time. Thanks for the info though. — Dutchie432, May 22 '13 at 15:11
yeah i understood, that's why I added the notice in brackets :) it's always make me laugh to read that answer, anyway :) — STT LCU, May 22 '13 at 15:12

Ro Yo Mi · Answer 1 · 2013-05-22T16:06:46.437

1

Description

This will look for all words with prefex 'cat' outside an anchor tag

You'll need to use a case insensitive option on the regex search command.

(?<=^|<[\/]a>)[^<]*\b(cat\w*|[^<]*?\s\bcat\w*)\b

enter image description here

PHP example of the regex

 <?php
$sourcestring="CatSoup<a href='...'>catsup</a>catfish tag itself like <a href='/part/catsup'> ... </a>";
preg_match_all('/(?<=^|<[\/]a>)[^<]*\b(cat\w*|[^<]*?\s\bcat\w*)\b/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

$matches Array:
(
    [0] => Array
        (
            [0] => CatSoup
            [1] => catfish
        )

    [1] => Array
        (
            [0] => CatSoup
            [1] => catfish
        )

)

To capture the location within the string you'd use the flag PREG_OFFSET_CAPTURE, but I'm not sure how to pull that value from the array. preg_match_all('/<a\b[^>]*>(cat\w*|[^<]*?\s\bcat\w*)/i',$sourcestring,$matches, PREG_OFFSET_CAPTURE);

Disclaimer

The inner text should really be pulled out using an html parsing engine first, this will avoid problematic edge cases where a regex parsing HTML text will fail. However I see in comments on the OP that you're in control of the HTML and it's rather basic so this disclaimer may not really apply.

edited May 22 '13 at 16:06

answered May 22 '13 at 15:24

Ro Yo Mi

14,790
5
35
43

This is really great, but If I'm understanding correctly, this is return instances that ARE linked - or ARE within the tag.. am I wrong? I hope not because I am not getting any results. `Pattern = /]*>(TTM\w*|[^<]*?\s\bTTM\w*) ... none` – Dutchie432 May 22 '13 at 15:30
My Intention is to find the UNLINKED items. – Dutchie432 May 22 '13 at 15:36
Oh ok... I misunderstood that bit... lets see – Ro Yo Mi May 22 '13 at 15:38
PS - what did you use to generate that nifty flowchart? – Dutchie432 May 22 '13 at 15:41
I'm using http://www.debuggex.com/ . Although it doesn't support lookbehinds like the one I just updated the answer for here. It's still handy for understanding the expression flow. There is also http://www.regexper.com/. They do a pretty good job too, but it's not real time as you're typing. – Ro Yo Mi May 22 '13 at 16:11

femtoRgon · Accepted Answer · 2013-05-22T15:44:11.837

1

I disagree that the recommendation to use a parser doesn't necessarily apply to this question. I would say it certainly does, it look likely you are dealing with enough structural complexity to make the regex approach infeasible.

However, assuming you actually are dealing with a basic enough subset of HTML syntax to be parsed by a regex, then I notice that in the examples given, you could just look for <\a> to follow the matched string somewhere, and reject the match if it appears, which can be done with a simple enough lookahead, like:

$pattern = "/".$preffix.'(?!.*<\/a>)/';

or perhaps, to ensure the lookahead only looks at the very next tag seen,

$pattern = "/".$preffix.'(?![^<]*<\/a>)/';

edited May 22 '13 at 15:44

answered May 22 '13 at 15:38

femtoRgon

32,893
7
60
87

I already thought of this approach - but it fails if I have two similar prefixes.. for example if my first prefix is `TTM` and my second prefix is `TT`, the Link that has `TTM...` will also be linked when `TT` comes around, since `TT` is not found. Make sense? Keep in mind, the words BEGIN with the prefix, but may have several letters between the prefix and the `` – Dutchie432 May 22 '13 at 15:44
Unless I misunderstand something, that is what the provided regex does. It will look for the `<\a>` tag anywhere following the matched prefix (note the `.*` preceding `<\/a>` in the lookahead. – femtoRgon May 22 '13 at 15:46
By the way, I assume that `$preffix`, by the time it's used looks something like `cat\w*+`? Don't really know what `sanitizePreffix` is doing, but if it was working as described before, I'dd guess something like that. – femtoRgon May 22 '13 at 15:56
This ended up being the solution that worked the best. `sanitizePreffix` just escapes the special chars in the string. – Dutchie432 May 23 '13 at 08:37
just one other thing... how would I got about this if I wanted to match the entire word - and not just words that start with `X`? – Dutchie432 May 23 '13 at 19:43
Not sure I understand what you mean. You want to match any word that is not within tags in the ways mentioned above? I guess you could replace the prefix with `\w++`. Not sure that's what you are looking for at all though. – femtoRgon May 23 '13 at 21:13
What I mean is: Rather finding a word that starts with "TT" - what if I wanted to find all occurances of "TT" that are full words, and not partial words. `TT` would be found, where as `TTBM` would not. – Dutchie432 May 24 '13 at 09:09
1

Ah, surround it with word breaks, `\b`. Like: "/\\b".$word.'\\b(?!.*<\/a>)/' – femtoRgon May 25 '13 at 00:20

Determine if position in html string is within an anchor tag

2 Answers2

Description

PHP example of the regex

Disclaimer