0

I want to take a long string (hundreds of thousands of characters) and to compare it against an array of keywords to determine which one of the keywords in the array is mentioned more than the rest.

This seems pretty easy, but I am a bit worried about strstr under performing for this task.

Should I do it in a different way?

Thanks,

Or Weinberger
  • 7,332
  • 23
  • 71
  • 116
  • Can the array of keywords contain phrases, or will the keywords always be a single word? – Andrew Clark Apr 26 '11 at 22:36
  • Can contain phrases, can you elaborate on the difference/performance change between a single keyword and a phrase? – Or Weinberger Apr 26 '11 at 22:37
  • Read some answers and you'll discover why. – gd1 Apr 26 '11 at 22:38
  • @Andrew: yeah, you need a more complex PDA-like tool – gd1 Apr 26 '11 at 22:41
  • Be careful when using `substr_count()` or similar functions, assuming you want to match on the whole word and not partial words. E.g. If "day" is a keyword, `substr_count()` will count matches for "day", "Saturday", "hey-day" etc. – White Elephant Apr 26 '11 at 22:45

2 Answers2

2

I think you can do it in a different way, with a single scan, and if you do it the right way, it can give you a dramatic improvement as of performance.

Create an associative array, where keys are the keywords and values are the occurrences.

Read the string word by word, I mean take a word and put it in a variable. Then, compare it against all the keywords (there are several ways to do it, you can query the associative array with isset). When a keyword is found, increment its counter.

I hope PHP implements associative arrays with some hashmap-like thingie...

gd1
  • 11,300
  • 7
  • 49
  • 88
0

Parse the words out in linear fashion. For each word you encounter, increment its count in the associative array of words you are looking for (skipping those you aren't interested in, of course). This will be much faster than strstr.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • So get the string, ``explode`` it to an array using a space as the separator, loop through the keywords array comparing each of the exploded words to the keyword? Sounds good.. – Or Weinberger Apr 26 '11 at 22:39
  • Not to me. I don't like the `explode` for performance reasons and because a word can be divided by the others in several ways – gd1 Apr 26 '11 at 22:43
  • @Giacomo - so how do you suggest to do it? – Or Weinberger Apr 26 '11 at 22:50
  • There are many ways and Marcelo pointed out quite the same one I suggested. Just read a word at a time: you can pipe a character at a time into a word-buffer, and stop when you find something that is not part of the word (point, space, etc...). Then do like I suggested, for example `if (isset($kewords[$this_keyword])) $keywords[$this_keyword]++`. I won't post the whole code. – gd1 Apr 26 '11 at 22:51