8

I've Collective Intelligence book, but I'm not sure how it can be apply in practical.

Let say I have a PHP website with mySQL database. User can insert articles with title and content in the database. For the sake of simplicity, we just compare the title.

  • How to Make Coffee?
  • 15 Things About Coffee.
  • The Big Question.
  • How to Sharpen A Pencil?
  • Guy Getting Hit in Balls

We open 'How to Make Coffee?' article and because there are similarity in words with the second and fourth title, they will be displayed in Related Article section.

How can I implement this using PHP and mySQL? It's ok if I have to use Python. Thanks in advance.

Azam
  • 797
  • 2
  • 8
  • 23

3 Answers3

12

Store a set of keywords alongside each product, which should essentially be everything in the title besides a set of stop words. When a title is displayed, you find any other products which share keywords in common (with those with one or more in common given priority).

You could further enhance this by assigning a score to each keyword based on its scarcity (with more scarce words being given a higher score, as a match on 'PHP', for instance, is going to be more relevant than a match on 'programming'), or by tracking the number of times a user navigates manually between a set of products.

Regardless you'd best start off by making it simple, and then enhance it as you go on. Depending on the size of your database more advanced techniques may not be all that fruitful.

Justin Simon
  • 1,133
  • 8
  • 8
  • What I understand here is that I have to explode all the words and store it in a table (ignore stop words) each time user insert an article including the article content (which may have thousands of words) and then attach the keyword with content id to each word. At the same time, the program will find similar words from other articles which can have many words again and find which article has more similar keywords and then store the related article in a table. I think it cam be heavy or is it not what I am thinking? – Azam Jun 10 '11 at 13:11
  • 1
    It needn't be "heavy". The simplest approach would be a many-to-many table with 2 columns - article ID and keyword. User selects article #1 which has keywords A, B, and C. You can do a simple COUNT like this: `SELECT articleID, COUNT(keyword) FROM keyword WHERE keyword IN (A, B, C) GROUP BY articleID ORDER BY COUNT(keyword) DESC` -- obviously this is very simple, just giving you the article which matches the MOST keywords, regardless of individual keyword value as **Justin Simon** said. But it'd be a start, and not at all difficult to get up and running. With a good index, also quite fast! – Drew Jun 10 '11 at 13:52
5

You're best off using a set of tags which are parsed and stored in the db when the title is inserted, and then querying based on that.

If you have to parse the title though, you'd basically be doing a LIKE query:

SELECT * FROM ENTRIES WHERE TITLE LIKE '%<keyword>%';

For a more verbose answer though:

// You need some test to see if the word is valid. 
// "is" should not be considered a valid match.
// This is a simple one based on length, a 
// "blacklist" would be better, but that's up to you.
function isValidEntry( $word )
{
    return strlen( $word ) >= 4;
}

//to hold all relevant search strings:
$terms = array();
$postTitleWords = explode( ' ' , strtolower( 'How to Make Coffee' ) );

for( $postTitleWords as $index => $word )
{
    if( isValidEntry( $word ) ) $terms[] = $word;
    else
    {
        $bef = @$postTitleWords[ $index - 1 ];
        if( $bef && !isValidEntry( $bef ) ) $terms[] = "$bef $word";
        $aft = @$postTitleWords[ $index + 1 ];
        if( $aft && !isValidEntry( $aft ) ) $terms[] = "$word $aft";
    }
}
$terms = array_unique( $terms );
if( !count( $terms ) ) 
{
    //This is a completely unique title!
}
$search = 'SELECT * FROM ENTRIES WHERE lower( TITLE ) LIKE \'%' . implode( '%\' OR lower( TITLE ) LIKE \'%' $terms ) . '\'%';
// either pump that through your mysql_search or PDO.
cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
0

This can be simply achieved by using wildcards in SQL queries. If you have larger texts and the wildcard seems to be unable to capture the middle part of text then check if the substring of one matches the other. I hope this helps. BTW, your question title asks about implementing recommendation system and the question description just asks about matching a field among database records. Recommendation system is a broad topic and comes with many interesting algorithms (e.g, Collaborative filtering, content-based method, matrix factorization, neural networks, etc.). Please feel free to explore these advanced topics if your project is to that scale.

Dan
  • 1,100
  • 2
  • 13
  • 36
Ramesh
  • 11
  • 2