Best clustering algorithm? (simply explained)

Question

Imagine the following problem:

You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too

I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.

APPROACH 1

<?php
$zeit = time();
function check($str1, $str2){
    $minprozent = 60;
    similar_text($str1, $str2, $prozent);
    $prozent = sprintf("%01.2f", $prozent);
    if ($prozent > $minprozent) {
        return TRUE;
    }
    else {
        return FALSE;
    }
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
    $rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
    $rel2 = mysql_query($rel1);
    $rel2a = mysql_num_rows($rel2);
    if ($rel2a > 0) {
        while ($rel3 = mysql_fetch_assoc($rel2)) {
            if (check($sql3['text'], $rel3['text']) == TRUE) {
                $id_a = $sql3['id'];
                $id_b = $rel3['id'];
                $rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
                $rein2 = mysql_query($rein1);
                $rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
                $rein4 = mysql_query($rein3);
            }
        }
    }
}
?>

APPROACH 2 [only check()]

<?php
function square($number) {
    $square = pow($number, 2);
    return $square;
}
function check($text1, $text2) {
    $words_sub = text_splitter($text2); // splits the text into single words
    $words = text_splitter($text1); // splits the text into single words
    // document 1 start
    $document1 = array();
    foreach ($words as $word) {
        if (in_array($word, $words)) {
            if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
        }
    }
    $rating1 = 0;
    foreach ($document1 as $temp) {
        $rating1 = $rating1+square($temp);
    }
    $rating1 = sqrt($rating1);
    // document 1 end
    // document 2 start
    $document2 = array();
    foreach ($words_sub as $word_sub) {
        if (in_array($word_sub, $words)) {
            if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
        }
    }
    $rating2 = 0;
    foreach ($document2 as $temp) {
        $rating2 = $rating2+square($temp);
    }
    $rating2 = sqrt($rating2);
    // document 2 end
    $skalarprodukt = 0;
    for ($m=0; $m<count($words)-1; $m++) {
        $skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
    }
    if (($rating1*$rating2) == 0) { continue; }
    $kosinusmass = $skalarprodukt/($rating1*$rating2);
    if ($kosinusmass < 0.7) {
        return FALSE;
    }
    else {
        return TRUE;
    }
}
?>

I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.

I hope you can help me. Thanks in advance!

There are WordPress plugins (yes, yuck, I know, spare me that) that do a surprisingly good job at this, they actually perform reasonable clustering (typically they do TF-IDF with back of words with k-means or something like that) and you can use them for inspirations (some of them are open source under MIT). — Benjamin Gruenbaum, May 02 '15 at 10:50
I think Anony-Mousse is right: clustering is not the ideal tool here. If each document belongs to just 1 cluster, then you have the problem of documents near the boundaries of a cluster being *more similar* to documents in other nearby clusters than to most of the documents in their own cluster. — j_random_hacker, May 05 '15 at 13:23

Albinofrenchy · Accepted Answer · 2009-05-12T19:35:52.810

15

The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.

First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:

for word in article.text
    article.histogram[indexLookup[word]]++

Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:

def check(articleA, articleB)
    rtn = 0
    for a,b in zip(articleA.histogram, articleB.histogram)
        rtn += a*b
    return rtn > threshold

(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)

This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.

Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.

By the way, this technique is described in more detail here

edited May 12 '09 at 19:35

answered May 12 '09 at 15:29

Albinofrenchy

424
5
10

Thank you very much! I've tried to code your approach in PHP and here's the result: http://paste.bradleygill.com/index.php?paste_id=9290 I hope your PHP is still good enough to say if it's correct or not. – caw May 12 '09 at 18:23
1

It seems to me to be correct, however, depending on your application, you seriously want to consider persisting the state of the term vectors. Also, consider dividing the score by the length of article a times the length of article b. Otherwise you will see a bias for long articles that are only marginally related. – Albinofrenchy May 12 '09 at 19:31
Sorry, certainly a stupid question, but what do you exactly mean with "consider persisting the state of the term vectors." On the second point: Do you mean "$score = $score/$length_a*$length_b" or "$score = $score/($length_a*$length_b)"? Probably the first one, right? – caw May 12 '09 at 20:12
I mean, instead of creating that vector whenever you are about to compare two articles, generate that vector whenever anyone saves an article and store it in a database. The second point: You want '$score = $score/($length_a*$length_b)'. If you check out the link I put up above, it has more about why you should do that(You are finding the 'angle' between the two vectors basically) – Albinofrenchy May 12 '09 at 20:43
Thank you for the quick replies. Now it should be finally correct: http://paste.bradleygill.com/index.php?paste_id=9326 – caw May 14 '09 at 18:56

score 6 · Answer 2 · answered May 02 '15 at 10:42

Maybe clustering is the wrong strategy here?

If you want to display similar articles, use similarity search instead.

For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.

Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!

Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.

score 1 · Answer 3 · answered May 13 '09 at 08:17

I believe you need to make some design decisions about clustering, and continue from there:

Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.

Thanks, good questions! 1) I want to display related documents together. 2) The algorithm should do flat clustering. 3) This would be useful if the texts were long, but in my case the articles contain at most 510 characters. So it isn't really necessary, is it? 4) The approach with k-means sounds good but I need lots of clusters and new clusters must be continuously created. Can I use k-means, though? — caw, May 13 '09 at 19:19
You can use K-means with K being very large. The cost is having to check the similarity of each document with every one of the clusters' centers. 'Continouously create new clusters' sounds like a top-down hierarchical clustering to me, but you can work in several epochs - start with a small K, run K-means until it converges, and use these clusters. Later, increase K, rerun K-means from the start, and use the resulting clusters, etc. — Yuval F, May 14 '09 at 05:57
Oh, I didn't know how k-means exactly works. If it works like that, I can't use it since I don't know the number of cluster centers. I have a database of news articles and all articles about the same topic should be grouped. — caw, May 14 '09 at 18:52

score 0 · Answer 4 · answered May 12 '09 at 17:26

What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.

However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,

My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.

The function similar_text() "calculates the similarity between two strings as described in Oliver [1993]". Yes, you're right, it's rather a similarity metric. But you need similarity checks for clustering, don't you? — caw, May 12 '09 at 17:56

Best clustering algorithm? (simply explained)

4 Answers4