0

I'm trying to count words in a sample string like this:

<p>&nbsp;<p>hello world!</p><p>&nbsp;</p></p>

after reading the documentation, I found a function which is supposed to do exactly what I'm trying to do. but somehow the result is not quite right.

this is the code I'm using:

function rip_tags($string) {

    // ----- remove HTML TAGs -----
    $string = preg_replace ('/<[^>]*>/', ' ', $string);

    // ----- remove control characters -----
    $string = str_replace("\r", '', $string);    // --- replace with empty space
    $string = str_replace("\n", ' ', $string);   // --- replace with space
    $string = str_replace("\t", ' ', $string);   // --- replace with space

    // ----- remove multiple spaces -----
    $string = trim(preg_replace('/ {2,}/', ' ', $string));

    return $string; 
}
$str = '<p>&nbsp;<p>hello world!</p><p>&nbsp;</p></p>';
$str = trim(html_entity_decode($str));
$str = rip_tags($str);
$c = str_word_count($str);
echo $c;

the result should've been 2 but the code returns 4.. what am I missing??

dapidmini
  • 1,490
  • 2
  • 23
  • 46
  • 2
    I get 2: https://ideone.com/kbhfsq – Sebastian Brosch Mar 30 '20 at 08:46
  • Use a HTML to text conversion tool first, like what they have described here: https://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail after that, the rest should be easy. – mzedeler Mar 30 '20 at 08:46
  • Invalid HTML in the first place might not help. (Or I'm reading nesting where there shouldn't be any.) – Progrock Mar 30 '20 at 08:57

2 Answers2

0

Your function still counts your &nbsp which you can see by:

$c = str_word_count($str,1);
var_dump($c);

If you remove your &nbsp with html_entity_decode() you will see proper counting. (html_entity_decode just converts the respective html entity to the character they represent, in your case whitespace.)

$string = html_entity_decode($string);

Glenn
  • 36
  • 2
  • that's one of the problem. I'm already using `html_entity_decode` but somehow the ` ` is still there.. – dapidmini Mar 30 '20 at 09:17
  • Ah sorry I overlooked your second part. I just copy pasted your code and it works fine on my local machine (returns 2). Try executing str_word_count($str,1) instead of str_word_count($str) and var_dump your result, so you/we know what your string actually looks like. – Glenn Mar 30 '20 at 12:02
0

Try your mileage with the following.

Strip the html, decode the entities, then count the words.

<?php

$html =<<<HTML
<p>&nbsp;<p>hello world!</p><p>&nbsp;</p></p>
HTML;

$stripped = strip_tags($html);
$decoded  = html_entity_decode($stripped);
echo str_word_count($decoded);

Output:

2
Progrock
  • 7,373
  • 1
  • 19
  • 25