Problem trying to extract words from string in PHP

Question

I'm trying to extract all words from a string into an array, but i am having some problems with spaces ( ).

This is what I do:

//Clean data to text only
$data = strip_tags($data);
$data = htmlentities($data, ENT_QUOTES, 'UTF-8');
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
$data = htmlspecialchars_decode($data);
$data = mb_strtolower($data, 'UTF-8');

//Clean up text from special chrs I don't want as words
$data = str_replace(',', '', $data);
$data = str_replace('.', '', $data);
$data = str_replace(':', '', $data);
$data = str_replace(';', '', $data);
$data = str_replace('*', '', $data);
$data = str_replace('?', '', $data);
$data = str_replace('!', '', $data);
$data = str_replace('-', ' ', $data);
$data = str_replace("\n", ' ', $data);
$data = str_replace("\r", ' ', $data);
$data = str_replace("\t", ' ', $data);
$data = str_replace("\0", ' ', $data);
$data = str_replace("\x0B", ' ', $data);
$data = str_replace("&nbsp;", ' ', $data);

//Clean up duplicated spaces
do {
   $data = str_replace('  ', ' ', $data);
} while(strpos($data, '  ') !== false);

//Make array
$clean_data = explode(' ', $data);

echo "<pre>";
var_dump($clean_data);
echo "</pre>";

This outputs:

array(58) {
  [0]=>
  string(5) " "
  [1]=>
  string(5) " "
  [2]=>
  string(11) "anläggning"
  [3]=>
  string(3) "med"
  [4]=>
  string(3) "den"
  [5]=>
  string(10) "erfarenhet"
  [6]=>
  string(3) "som"
}

If i check source for output i see that the first 2 array values is  .
No matter how I try, I can't remove this from the string. Any ideas?

UPDATE:
After some tweaking with code i manage to get following output:

array(56) {
  [0]=>
  string(1) "�" //Notice change. Instead of string length 5 it now says 1. But still its garbage.
  [1]=>
  string(1) "�"
  [2]=>
  string(11) "anläggning"
  [3]=>
  string(3) "med"
  [4]=>
  string(3) "den"
  [5]=>
  string(10) "erfarenhet"
  [6]=>
  string(3) "som"
  [7]=>
  string(5) "finns"
  [8]=>
  string(4) "inom"

Thanks!

ANSWER (for lazy people):

Even thou this is a slightly different approach to the problem, and it never really answers why I had the problems I had above (like leftover   and other extra weird spaces), I like it and it is a lot better than my original code.

Thanks to all who contributed to this!

//Clean data to text only
$data = strip_tags($data);
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
$data = htmlspecialchars_decode($data);
$data = mb_strtolower($data, 'UTF-8');

//Clean up text from special chrs
$data = str_replace(array("-"), ' ', $data);    

$clean_data = str_word_count($data, 1, 'äöå');

echo "<pre>";
var_dump($clean_data);
echo "</pre>";

you sure the actual data contains the (and not only the output generated by var_dump) ? — JohnSmith, Dec 15 '10 at 13:19
Someone is going to post a version of this that does the same thing correctly in about 6 lines :) — thirtydot, Dec 15 '10 at 13:22
JohnSmith: Yes. I get same if I do echo $data[0] or echo $data[1]. — lejahmie, Dec 15 '10 at 13:48
thirtydot: I would be very happy if someone did so. :) But this is harder than one might expect, specialy since I am working with UTF-8 and characters such as ÅÄÖ. — lejahmie, Dec 15 '10 at 13:48

score 2 · Accepted Answer · edited May 23 '17 at 09:58

2

Ok, the only thing you would have to do is to replace   with a space as you already do (only if the string really still contains   check @Andy E's answer to make sure that that your data does not contain any HTML entities.):

$data = str_replace("&nbsp;", ' ', $data);

Then you can use str_word_count to get the words:

$words = str_word_count($data, 1, 'äöåÄÖÅ');

P.S.: What is the sense of calling htmlentities first and then revert it again in with html_entity_decode anyway?

Update: Example:

$str = '      anläggning med den      erfahrenhet som åååÅ ÅÅ';
print_r(str_word_count($str, 1, 'äöåÄÖÅ'));

prints

Array
(
    [0] => anläggning
    [1] => med
    [2] => den
    [3] => erfahrenhet
    [4] => som
    [5] => åååÅ
    [6] => ÅÅ
)

Reading documentation helps :)

edited May 23 '17 at 09:58

Community

1
1

answered Dec 15 '10 at 13:24

Felix Kling

795,719
175
1,089
1,143

Unfortunately str_word_count doesn't work with åäö. It cuts all words as two words when it hits å, ä or ö. – lejahmie Dec 15 '10 at 13:41
You can pass in a custom `$charlist` as the third argument to `str_word_count` - "A list of additional characters which will be considered as 'word'" – thirtydot Dec 15 '10 at 13:47
@jamietelin: As @thirthdot says, pass a string with extra characters that should considered part of the word. It is described in the documentation I linked to. Believe me, reading documentation helps! – Felix Kling Dec 15 '10 at 13:48
@jamietelin: Depends, not every documentation is good and not every problem is addressed in documentation... but stackoverflow should not be a replacement for reading documentation. They should be consulted first imo. – Felix Kling Dec 15 '10 at 15:13
@Felix Kling: Thanks for pointing out how `str_word_count` works :) If we all read every documentation, stackoverflow wouldn't be needed i guess (it is a joke). I didn't even know when I started this that there was such a thing as `str_word_count`. Ty a lot! – lejahmie Dec 15 '10 at 15:23
@jamietelin: No worries, I didn't mean to sound harsh... happy coding! :) – Felix Kling Dec 15 '10 at 15:33

Andy E · Answer 2 · 2010-12-15T13:34:29.973

Is it possible you're "double encoding" any existing   parts of the string? You call htmlentities on the string before html_entity_decode, so any existing   characters would become &nbsp;. You can prevent htmlentities from double encoding by providing false as the fourth parameter.

$data = htmlentities($data, ENT_QUOTES, 'UTF-8', false);
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');

Also, bare in mind that you can pass an array for matches in str_replace:

$data = str_replace(array(',','.',':',';','*','?','!','-'), '', $data);

Linus Kleen · Answer 3 · 2010-12-15T14:11:48.977

1

Instead of:

14x str_replace

do {
   $data = str_replace('  ', ' ', $data);
} while(strpos($data, '  ') !== false);

do:

$data = preg_replace('/[.*,:;?!]/', '', $data);
$data = preg_replace('/(?:\xC2\xA0|\s{2,}|-)/', ' ', $data);

Whereas 0xC2A0 is the non-breaking space ( ) and \s is any white-space character covering the repeated str_replace calls.

edited Dec 15 '10 at 14:11

answered Dec 15 '10 at 13:25

Linus Kleen

33,871
11
91
99

This actually did something. But instead of a space, I get some weird character I cant post here it seems. – lejahmie Dec 15 '10 at 13:47
1

Is the "weird character" a `0xC2`? – Linus Kleen Dec 15 '10 at 13:57
goreSplatter: I want to kiss you right now! I tested "your character" `$data = str_replace(array("\xC2"), ' ', $data);` and it worked!!! What the hell is that character `0xC2`?! – lejahmie Dec 15 '10 at 14:07
1

See my edit. `0xC2A0` is the Unicode for the [Non-breaking space](http://en.wikipedia.org/wiki/Non-breaking_space). – Linus Kleen Dec 15 '10 at 14:12

Manu · Answer 4 · 2010-12-15T14:18:48.600

1

print_r( explode(" ", $data));

Update

define("WORD_COUNT_MASK", "/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u");

function str_word_count_utf8($str)
{
     preg_match_all(WORD_COUNT_MASK, $str, $matches);
     print_r( $matches);
}
str_word_count_utf8( $str);

edited Dec 15 '10 at 14:18

answered Dec 15 '10 at 13:56

Manu

4,101
1
17
23

Thanks for your effort?! But this didn't do anything. It is exactly what i already do. – lejahmie Dec 15 '10 at 14:05

score 0 · Answer 5 · answered Dec 15 '10 at 13:23

0

$data = '&nbsp; cesadasdsadas <br /> &nbsp; dsadsadas';
$data = preg_replace('/&nbsp;/', ' ', $data);
var_dump($data);

answered Dec 15 '10 at 13:23

Poelinca Dorin

9,577
2
39
43

Doesn't make any difference :/ – lejahmie Dec 15 '10 at 13:43

score 0 · Answer 6 · answered Dec 15 '10 at 13:28

0

maybe you should try this : http://php.net/manual/en/function.str-word-count.php

I've made something close to your goal recently :

    $words = array_unique(str_word_count($CONTENT." ".$TITLE, 1));
    sort($words);
    $words = addslashes (implode(" ", array_values($words)));

Bye.

answered Dec 15 '10 at 13:28

Django

21
2

Unfortunately str_word_count doesn't work with åäö. It cuts all words as two words when it hits å, ä or ö. – lejahmie Dec 15 '10 at 13:39

Problem trying to extract words from string in PHP

6 Answers6