0

I want to put a space after anchor tag so that the next word becomes separate from it. The problem is there are anchor tags after which there is   characters or there could be another html tag opening. So in those cases we do not want to put a space as it will break our records.

I only want to put space after anchor if there is no space and there is a word.

Right now i have come up with regex which i am not sure is exactly what i want

 preg_replace("/\<\/a\>([^\s<&nbsp;])/", '</a> $1', $text, -1, $count);
 print "Number of occurence in type $type = $count \n";
 $this->count += $count;

I tried to see the number of occurence before i actually save the replaced string. But it is showing way higher amount which i highly doubt cannot be.

Please help me fixing this regex.

Scenarios:

<a href="blah.com">Hello</a>World // Here we need to put space between Hello and World

<a href="blah.com">Hello</a>&nbsp;World // Do not touch this

<a href="blah.com">Hello</a><b>World</b> // do not touch this

There could be so many cases that has to be ignore but specifically speaking we need the first scenario to be executed

Raheel
  • 8,716
  • 9
  • 60
  • 102
  • There are mistakes there, but they don't explain for a higher count. Could you provide a sample text (html) with limited size, but enough to show the problem? – trincot Aug 17 '16 at 13:50
  • Actually i do not have the actual text right now because its all based on our assumptions. Lets put it this way. If there is any human language word could be english, thailand and there is no space between the anchor closing and that word. We have to put a space there so that it becomes actual words. Other than that just ignore them – Raheel Aug 17 '16 at 13:54
  • 1
    Just realise that the next word can be wrapped in a tag like `span`, and still stick to the previous word, so excluding `<` will not always work. Secondly, in a regex class, you cannot test for strings, only individual characters. So the test on ` ` will exclude also `n`, `b`, ...etc. – trincot Aug 17 '16 at 13:57
  • @trincot please see the updated part. I understand about the `span` part but for now we can skip this because we know there would be hardly few cases like that. – Raheel Aug 17 '16 at 13:59
  • What is wrong with `<\/\w+>(\w+)`? – revo Aug 17 '16 at 14:01
  • @revo this is the first ever regex i have written. Can you please post this as a answer so that i can try it. With little explanation. It could be helpful for me and other folks also. Thanks – Raheel Aug 17 '16 at 14:03
  • You can use /(?<=<\/a>)(\w+)/g and replace with space$1. https://regex101.com/r/iM8eO1/1 – Shekhar Khairnar Aug 17 '16 at 14:09

3 Answers3

2

As @trincot pointed out [^\s<&nbsp;] doesn't mean if it is not a space or non-breaking space. It's a character class and whatever is between those brackets has a mean of a single character only. So it means if it is not a space or < or & or...

You need to check if very next character is a word character \w which denotes [a-zA-Z0-9_], then consider to add an space at zero-width assertion of used positive lookahead:

 preg_replace("~</a>\K(?=\w)~", ' ', $text, -1, $count);
 echo "Number of occurrences in type $type is $count \n";

What does this RegEx mean?

</a>    # Match closing anchor tag
\K      # Reset match
(?=\w)  # Look if next character is a word character

Update: Another solution to cover all HTML-problematic cases:

preg_replace("~</a>\K(?!&nbsp;)~", '&nbsp;', $text, -1, $count);

This adds a non-breaking space when there is no non-breaking space after closing anchor tag.

revo
  • 47,783
  • 14
  • 74
  • 117
2

As you will probably find out, the regex solution will sooner or later prove insufficient. For example, it will not detect that in this HTML snippet the two words are displayed without white space between them:

<a>test</a><span>hello</span>

There are numerous other cases where a regex solution would have a hard time to detect adjacent words like that, as the rendering of HTML is not as straightforward as it may seem.

Although you already accepted a solution, I here provide a solution that uses the DOMDocument interface available in PHP to detect where link texts would stick to the text that follows it, even if it is remotely separated from it in the DOM node hierarchy:

function separateAnchors($html) {
    // Define a character sequence that 
    // will certainly not occur in your document,
    // and is interpreted as literal in regular expressions:
    $magicChar = "²³²"; 
    $doc = new DOMDocument();
    $doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($doc);
    $anchors = $xpath->query("//a");
    foreach (array_reverse(iterator_to_array($anchors)) as $anchor) {
        $parent = $anchor->parentNode;
        $origAnchor = $anchor->cloneNode(true);
        // temporariy put the special text in the anchor
        $anchor->textContent = $magicChar;
        // and then take the document's text content
        $txt = $doc->textContent;
        // If that contains the special text with a non-space following it:
        if (preg_match("/{$magicChar}\S/u", $txt)) {
            // ... then add a single space node after it, after
            // any closing parent nodes
            $elem = $anchor;
            while (!$elem->nextSibling) $elem = $elem->parentNode;
            $elem->parentNode->insertBefore($doc->createTextNode(" "), 
                                            $elem->nextSibling);
        }
        // Put original anchor back in place
        $parent->replaceChild($origAnchor, $anchor);
    }
    return $doc->saveHTML();
}

// sample data
$html = "<p><a>first link</a>&nbsp;<a>second link</a>this word is too close</p>\n
         <table><tr><td><a>table cell</a></td></tr></table><span>end</span>\n
         <span><a>link</a></span><span><a>too close</a></span>";

// inject spaces
$html = separateAnchors($html);

// Show result
echo $html;

See it run on ideone.com

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Great.. Is it faster than the regex thing ? And in the output i see some discrepancy like it places a space fore `` & `` – Raheel Aug 17 '16 at 15:35
  • 1
    It might add a space before closing tags like `` and `` which are block elements, and already generate a natural break (unless their default style is overridden with CSS!), but I thought it would not hurt to add the space anyway. Alternatively, the space could be inserted after the last closing tag that immediately follows it, but that would require a bit more code. And no, I don't think this will run faster than a simple regex solution, but regex for HTML parsing is known to be insufficient in general. See [this famous answer on that](http://stackoverflow.com/a/1732454/5459839). – trincot Aug 17 '16 at 15:39
  • got your point. Its a bit learning curve but good to know thanks :) – Raheel Aug 17 '16 at 15:42
  • *regex for HTML parsing is known to be insufficient* again this is not called parsing. – revo Aug 17 '16 at 15:52
  • 1
    @RaheelKhan, I updated the code to inject the space after any closing tags, in answer to your comment on `` and ``. – trincot Aug 17 '16 at 15:53
1

You can use: /(?<=<\/a>)(\w+)/g regex

Meaning: find the word preceded by closing anchor tag and replace it with space and first capture group reference($1)

Demo and Meaning of each construct used

Shekhar Khairnar
  • 2,643
  • 3
  • 26
  • 44