3

I have a file of "words" that is about 5.8 MB in size and has 560,000 words in it. I'm using it to get real words from strings that are joined together.

E.g. greenbananatruck could be such string.

I wrote this function to be used in very fast pace. But I can't get it to be faster then 0.5 sec. I'm using server with 8 core processor, 8GB RAM. Actually cpu is not a problem the problem is RAM. I need to be able to do this process quickly and efficiently at multiple instances.

public function wordSplitReal( $str ){

  $words = array_filter( $this->dict, function($word) use(&$str) {
      $pos = strpos( $str, $word );
      if ( $pos !== false ){
          $str = substr_replace($str, "", $pos, strlen($word));
          return true;
      }
      return false;
  } );

  return $words;

}

It's very simple, what I'm actually doing is "filtering" the array "dict" to only the words that are in the given string. (I'm not interested in multiple words.) Dict is presorted from the longest to the shortest word. All in only lower letters. This func is part of bigger class using singleton.

Any help would be appreciated.

Martin Šajna
  • 123
  • 2
  • 8

1 Answers1

1

Arrays are a wrong tool for the job, since they access in linear time (which, as you're finding, is too slow for dictionaries). You probably want a trie; there are several PHP implementations if you search for them. (I don't have any experience with any of PHP trie libraries, so I can't recommend you one.)

The outline of the algorithm might be:

While string is non-empty
  For all prefixes of str in decreasing order:
    If it is in trie:
      Drop the prefix
      Add it to the result array
      Next iteration of outer loop
  Return failure
Return result array

(The algorithm is not very sophisticated, as it does not implement backtracking; left as an exercise for the reader :p )

Amadan
  • 191,408
  • 23
  • 240
  • 301