3

I need help to create a script for finding keywords in a string, and inserting them into a database for use in a tag cloud.

  • The script would need to obviously dismiss characters, and common words like 'I', 'at', 'and', etc.
  • Get a value for the frequency of each keyword it finds and then insert it into the database if it's new, or update the existing row with the addition of the strings keyword count.
  • The string is unformatted text from a database row.

I'm not new to PHP, but I haven't attempted anything like this before, so any help is appreciated.

Thanks, Lea

Lea
  • 934
  • 9
  • 17
  • 34

5 Answers5

2

Google + php keywords from text = http://www.hashbangcode.com/blog/extract-keywords-text-string-php-412.html

Dejan Marjanović
  • 19,244
  • 7
  • 52
  • 66
1

Well, the answer is already there, I still post my code for the little work that has gone into it.

I think that a mysql db is not ideal for storing this kind of data. I would suggest something like memcachedb, so you can easily access a keyword by using it as an index to fetch the count from the db. Persisting those keywords in a high load environment may cause problems with a mysql db.

$keyWords = extractKeyWords($text);

saveWords($keyWords);

function extractKeyWords($text) {
    $result = array();

    if(preg_match_all('#([\w]+)\b#i', $text, $matches)) {
        foreach($matches[1] as $key => $match) {

            // encode found word to safely use as key in array
            $encodedKey = base64_encode(strtolower($match));

            if(wordIsValid($match)) {
                if(array_key_exists($encodedKey, $result)) {
                    $result[$encodedKey] = ++$result[$encodedKey];  
                } else {
                    $result[$encodedKey] = 1;
                }
            }
        }
    }

    return $result;
}

function wordIsValid($word) {
    $wordsToIgnore = array("to", "and", "if", "or", "by", "me", "you", "it", "as", "be", "the", "in");
    // don't use words with a single character
    if(strlen($word) > 1) {
        if(in_array(strtolower($word), $wordsToIgnore)) {
            return false;
        } else {
            return true;    
        }
    } else {
        return false;       
    }
}

// not implemented yet ;)
function saveWords($arrayOfWords) {
    foreach($arrayOfWords as $word => $count) {
        echo base64_decode($word).":".$count."\n";
    }
}
Nick Weaver
  • 47,228
  • 12
  • 98
  • 108
0

You could approach this with a dictionary of keywords or a dictionary of words to ignore. If you make a dictionary of key words then count each time one is used and then update a database table with the keywords. If you make a dictionary of words to ignore then strip those words from posts and insert or update a count for all the remaining words into the keyword table.

Andrew Jackman
  • 13,781
  • 7
  • 35
  • 44
  • Ok I see. In theory yes, but I don't know how to approach it practically. I can gather that I should create two arrays and use them as the "dictionaries".. but how do I do the counting, and ignoring? I'm new to using arrays so a practical example would help. – Lea Apr 03 '11 at 13:50
0

The way does it is by storing every word entered in every post in a table. When people search the forum, the result is the post IDs from which the words came.I suggest something like this.

Compare a user submission with your array of blacklisted (obvious) words which would come from a database table. THe words that survive are your keywords. Enter those keywords into your database table. Then use a SELECT * statement from your table to return a result set. Use the array_count function as demonstrated to get your count.

Perhaps a better way is to do what most sites do and force the user to enter their keywords (Stackoverflow, delicious, etc.) That way you can skip all the parsing up front.

Jason Strimpel
  • 14,670
  • 21
  • 76
  • 106
  • Yes, I will make the user put their own keyword in, but I am working with existing data, and "upgrading" i guess, to add this functionality because tags/keywords will be implemented in the "new" system. – Lea Apr 03 '11 at 14:27
-1

If the string is not too long and you won't have memory issues with storing the string in arrays, how about this?

# string to parse, comes from the database as you suggested
$string = 'I at and Cubs PHP Cubs';

# string is now an array
$stringArray = explode(" ", $string);

# list of "obvious" words to exclude, this would probably come from a database table
$wordsToExclude = array('I', 'at', 'and');

# array that contains your "keywords"
# Array('Cubs', 'PHP', 'Cubs')
$keywordArray = array_diff($stringArray, $wordsToExclude);

# array with the keyword as the key and the count as the value
# Array('Cubs' => 2, 'PHP' => 1)
$countedValues = array_count_values($keywordArray);

Now you need to "search" the database for the keys in the $countedValues array. What does your table look like?

Or of course you could avoid reinventing the wheel and Google "php tag cloud"...

Reference: PHP array functions

Jason Strimpel
  • 14,670
  • 21
  • 76
  • 106
  • this is unusable in the real life example. – Your Common Sense Apr 03 '11 at 13:52
  • Google "php tag cloud" will give you either HTML formatting or selecting from DB. In fact, the question has nothing to do with cloud and, possibly - with tags. – Your Common Sense Apr 03 '11 at 13:54
  • Can you help me understand why? Given the information provided, it seems a simple solution to start with. – Jason Strimpel Apr 03 '11 at 13:55
  • Thanks. I haven't created the table to insert the tags yet, but the string will be considerably long. But I can shorten it. What would be a character limit per string, you might suggest so I don't have memory issues? – Lea Apr 03 '11 at 13:55
  • Why don't you try it on some real blog post? or just on the opening post? – Your Common Sense Apr 03 '11 at 13:57
  • @Lea: I think what @Col. Shrapnel is vaguely implying is that my proposed solution is not robust. I will not parse HTML, special characters, etc. the idea was to get you started. You don't want to shorten the string. If the strings are long, we'll need another solution. – Jason Strimpel Apr 03 '11 at 14:01
  • 1
    I am ignoring him, he is being unconstructive, I suggest you do the same. I appreciate ANY help, as I did say in the OP. – Lea Apr 03 '11 at 14:05
  • 1
    @Lea in fact I am being constructive. Sometimes we need some criticism too, not only code to copy/paste. – Your Common Sense Apr 03 '11 at 14:12
  • 3
    Of course. But that isn't what the question is about. Maybe you should go and ask your own question about that topic. And you could call it "Should this website be about giving out copy & paste examples, or help a poster learn through theory"... no doubt it would be a great question. But it's off-topic here. None of your input has helped aside from being critical, which again, isnt the point of the post, or this forum. People don't ask questions for you to criticize them. And people don't give answers for you to prod them. You haven't helped a bit, therefore you haven't been constructive. – Lea Apr 03 '11 at 14:18
  • @Col. Shrapnel, I get that a lot :) http://stackoverflow.com/questions/5526457/php-convert-string-to-htmlentities/5526701#5526701 – Dejan Marjanović Apr 03 '11 at 14:19
  • @Lea Oh. I thought my first comment did make you think, which way you gonna choose - extract keywords automatically or compare them to the list. But it seems you still don't know what you want. I am sorry then. That's indeed my fault – Your Common Sense Apr 03 '11 at 14:45