1

I am currently using preg_match_all() to find all words that begin with a specific of preffix. For example, if the preffix is cat, then catsup would be considered a match whereas housecat would not.

Once these instances and their offsets are found, I am cycling through them and essentially encapsulating them with an anchor tag.

(Question Continued Below Code)


//Escape all non-standard characters
$preffix = sanitizePreffix($part['modlnoPreffix']);

//All Words Starting with preffix string
$pattern = "/".$preffix.'/'; 

//Find Matches
preg_match_all($pattern , $item['body'], $matches,PREG_OFFSET_CAPTURE);
$matches = array_reverse($matches[0]);

if (count($matches)>0){
    foreach ($matches as $match){
        $text = $match[0];
        $offset = (int)$match[1];
        $endOffset = $offset + strlen($text);
        $url = "/specsheet_getPreffixParts.php?m=".urlencode($text);

        //Insert ending </a> Tag                    
        $item['body'] = str_insert('</a>', $item['body'], $endOffset);

        //Insert Starting <a ...> Tag
        $item['body'] = str_insert("<a rel='".$url."' href='javascript:void(0);'>", $item['body'], $offset);
    }
}

The one catch is that I need to check each resulting index to make sure that

  1. The result is not already linked like <a href='...'>catsup</a>
  2. The result is not within the starting <a> tag itself like <a href='/part/catsup'> ... </a>

I'm sure I could easily create a function that would step backwards one character at a time searching for <a and then step forward one character at a time looking for </a>, but this seems a bit silly to me.

My question is: Is there a better way to do this? My initial instinct is to make this part of the initial search pattern used by preg_match_all - in other words ....

How would I find all words that start with 'cat' but are not located between a '<a' and a '</a>'

Dutchie432
  • 28,798
  • 20
  • 92
  • 109
  • FYI: It's spelled `prefix`, save you a keystroke ;-) – Funk Forty Niner May 22 '13 at 15:07
  • 1
    @Fred I noticed that as well. I am modifying someone elses (usually stellar) code. I'll do a Search + Replace later on. Thanks! :) – Dutchie432 May 22 '13 at 15:10
  • 1
    @STTLCU I actually came across that in my searches. I'm not really trying to "Parse HTML" - This is a basic text search that does not need to "Understand the complexities of HTML" - since I already do :) Plus, the HTML is super basic, since I am on control of it the entire time. Thanks for the info though. – Dutchie432 May 22 '13 at 15:11
  • yeah i understood, that's why I added the notice in brackets :) it's always make me laugh to read that answer, anyway :) – STT LCU May 22 '13 at 15:12

2 Answers2

1

Description

This will look for all words with prefex 'cat' outside an anchor tag

You'll need to use a case insensitive option on the regex search command.

(?<=^|<[\/]a>)[^<]*\b(cat\w*|[^<]*?\s\bcat\w*)\b

enter image description here

PHP example of the regex

 <?php
$sourcestring="CatSoup<a href='...'>catsup</a>catfish tag itself like <a href='/part/catsup'> ... </a>";
preg_match_all('/(?<=^|<[\/]a>)[^<]*\b(cat\w*|[^<]*?\s\bcat\w*)\b/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

$matches Array:
(
    [0] => Array
        (
            [0] => CatSoup
            [1] => catfish
        )

    [1] => Array
        (
            [0] => CatSoup
            [1] => catfish
        )

)

To capture the location within the string you'd use the flag PREG_OFFSET_CAPTURE, but I'm not sure how to pull that value from the array. preg_match_all('/<a\b[^>]*>(cat\w*|[^<]*?\s\bcat\w*)/i',$sourcestring,$matches, PREG_OFFSET_CAPTURE);

Disclaimer

The inner text should really be pulled out using an html parsing engine first, this will avoid problematic edge cases where a regex parsing HTML text will fail. However I see in comments on the OP that you're in control of the HTML and it's rather basic so this disclaimer may not really apply.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
1

I disagree that the recommendation to use a parser doesn't necessarily apply to this question. I would say it certainly does, it look likely you are dealing with enough structural complexity to make the regex approach infeasible.

However, assuming you actually are dealing with a basic enough subset of HTML syntax to be parsed by a regex, then I notice that in the examples given, you could just look for <\a> to follow the matched string somewhere, and reject the match if it appears, which can be done with a simple enough lookahead, like:

$pattern = "/".$preffix.'(?!.*<\/a>)/';

or perhaps, to ensure the lookahead only looks at the very next tag seen,

$pattern = "/".$preffix.'(?![^<]*<\/a>)/';
femtoRgon
  • 32,893
  • 7
  • 60
  • 87