31

I have a website with stories in it. I can have multiple types of stories within multiple categories like:

  • children
  • romance
  • scifi
  • action
  • thriler
  • quests

The stories are accessible using urls like:

www.example.com/action/story-name-action/
www.example.com/romance/story-name-romance/

and the first param (action) and the second (story-name-action) are redirected with .htaccess using rules. This part works just fine.

Lately, I get few dozen of 404 from different sites and here's what I want to do but I dont know how:

If someone types, for example: /action/story-nme-ction, I want to redirect to: action/story-name-action/

Is there an efficient way to implement this?

cdeszaq
  • 30,869
  • 25
  • 117
  • 173
natalia
  • 839
  • 5
  • 12

6 Answers6

37

Oh man, oh man!

What you're asking for is not simple and need you to have a powerful computer, but the results are simply amazing.

Here's what I'd suggest to do:

  • For 404 proper handling, you have the ErrorDocument redirection in vhost configuration. Mine looks like this: ErrorDocument 404 /404.php ;
  • When having a 404, Apache will call /404.php with all the arguments (which bad URL and so on, dump $_SERVER to see this). You have to test if there are only two expressions in the URL / i.e. http://mysite.com/(expr1)/(expr2)/
  • If not, then do a classical 404.
  • If yes then do a SOUNDEX search with MySQL (in your 404 Php file). See query sample here.
  • Then, in this "special" 404 case, do a suggestion, like google does, i.e.: "did you mean /action/story-name-action/? if so, click on the link".

This a hard work, but it's both interesting and shows your skill. Very few websites do this (I just know google actually).

Here's a demo on my French table that could give you an overview of how it works:

mysql> SELECT * FROM job WHERE
SOUNDEX( description ) LIKE SOUNDEX('Machiniste cinéma');
+-------+--------------------+
| id    | description        |
+-------+--------------------+
| 14018 | Machiniste cinéma  |
+-------+--------------------+
1 row in set (0.06 sec)

mysql> SELECT * FROM job WHERE
SOUNDEX( description ) LIKE SOUNDEX('Mchiniste cinéma');
+-------+--------------------+
| id    | description        |
+-------+--------------------+
| 14018 | Machiniste cinéma  |
+-------+--------------------+
1 row in set (0.06 sec)

mysql> SELECT * FROM job WHERE
SOUNDEX( description ) LIKE SOUNDEX('Machnste cinema');
+-------+--------------------+
| id    | description        |
+-------+--------------------+
| 14018 | Machiniste cinéma  |
+-------+--------------------+
1 row in set (0.06 sec)

mysql> 
Olivier Pons
  • 15,363
  • 26
  • 117
  • 213
  • Soundex is intended for natural language parsing - it does have the advantage of coming with mysql out of the box, however numbers have very specific significance in soundex, and non-alphanumerics are not handled by it (what does "@?/!" sound like?) – symcbean Jan 22 '12 at 00:54
  • 2
    You're right. I've just suggested the only solution that is the closest to what natalia wants. And what she wants is almost impossible to do. – Olivier Pons Jan 22 '12 at 06:51
  • Soundex is the _wrong_ algo here because it aims to achieve phonetic invariance. What you need is Typex ;-? -- that is a qwerty typo invariance, and that is entirely a different measure: i.e. is a word in the URL one letter or one QWERTY transposition different from one of a list of known words. There may be some open source algos related to spell-check functions out there (e.g. in Openoffice.org) that might be a starting point, but this is going to be a many-month effort for trivial impact on hits -- because the delta will be the % of typo-URLs where the user doesn't retry correctly. – TerryE Jan 31 '12 at 19:10
  • 1
    Thank you for Typex, never heard about it, and no valuable link on google. I do agree that it's a lot of work for a few % errors. Anyway... did you try to do a query in MySQL like `SELECT * FROM table WHERE SOUNDEX( table.column ) LIKE SOUNDEX('story-nme-ction')`? I've edited my question with a real-life demo =) – Olivier Pons Jan 31 '12 at 19:55
  • 1
    If you have some links for `Typex` I'm interested! – Olivier Pons Jan 31 '12 at 19:58
  • 1
    Because you query `WHERE SOUNDEX(description) LIKE SOUNDEX('Machnste cinema')` you ought to store `SOUNDEX(description)` in it's own column, say, `descriptionex`. Then you can go `WHERE descriptionex LIKE SOUNDEX('Machnste cinema')`. `descriptionex` is indexable and great for performance. – Stephen Quan Feb 04 '12 at 06:07
  • You're perfeclty right. I wasn't talking about performance, I was just trying to put natalia on the right path. Thank you for this optimization suggestion! – Olivier Pons Feb 04 '12 at 16:43
18

Unless you are very sure of the URL the user really wanted to navigate to, using rewrite / redirecting to a specific URL is a very bad idea.

Taking your example, suppose you want to handle every case where two letters may have been dropped, with 17 characters in the last part of the URL, that's 17*16 = 272 combinations, while it may be possible to match multiple 'false' urls with one regex, you're stil going to need a lot of rewrite rules.

A better solution would be, to implement 404 handler using PHP (since you included that tag in your q), to generate a list of (say) the top 10 URLs whose paths have the shortest levenstein distance from the requested path, along with a default link and supporting text. (There are mysql based implementations - try Google for URLs). NB handler should still return a 404 status - NB HTML content must be more than a minimum length to suppress MSIE's 'friendly' error message.

symcbean
  • 47,736
  • 6
  • 59
  • 94
  • +1.. I was just about to write that myself.. Maybe something like `SOUNDEX` on the MySQL level could be used to find similar URLs. Or `SELECT * FROM pages WHERE pageUrl SOUNDS LIKE userEnteredUrl`. – bummzack Jan 20 '12 at 12:37
7

If you know what the possible correct URLs could be, you can use:

levenshtein($givenURL, $possibleURL)

Example from PHP docs, comments removed for brevity:

$input = 'carrrot';

$words  = array('apple','pineapple','banana','orange',
                'radish','carrot','pea','bean','potato');

$shortest = -1;

foreach ($words as $word) {
    $lev = levenshtein($input, $word);
    if ($lev == 0) {
        $closest = $word;
        $shortest = 0;
        break;
    }
    if ($lev <= $shortest || $shortest < 0) {
        $closest  = $word;
        $shortest = $lev;
    }
}

echo $shortest == 0 ? "Exact match found: $closest\n" : "Did you mean: $closest?\n";

Outputs:

Input word: carrrot
Did you mean: carrot?

This is good when you think people may have omitted a letter or put an extra one in, but it may fall short when people genuinely don't know how to spell a word and came up with something creative!

If you prefer the soundex() route, take a look at the metaphone() function.

I like the idea of using metaphone() alongside levenshtein() or similar_text(), as it returns a phonetic representation of the word, and you still want to see how similar it is to your original.

Examples:

metaphone('name') = NM
metaphone('naaaaaameeeeeeee') = NM
metaphone('naiym') = NM
metaphone('naiyem') = NYM

While a lot of misspellings will return an identical match, the last example shows that you really still want to find the closest match with something like levenshtein()

For efficiency, if you use a different 404 file where the rewrites tried to match this pattern and failed, than you use for the rest of the site, it really shouldn't really be a massive overhead.

If you're getting the same 404 from the same referrer a lot, (and can't get them to change the link) it might be worth just putting a static rewrite in for that case.

Leigh
  • 12,859
  • 3
  • 39
  • 60
4

There are a few solutions:

  • Identify the source of the faulty URL's. This simply should not happen, and I can't imagine why it does. Are other people linking from elsewhere and they made a typo (ignoring the existence of copy&paste)? Can you see where this comes from (referer) and contact them?
  • Add an ID to the URL, so /action/123/story-name-action, where you look the article up on ID and not on its title (bonus: adds the possibility to create multiple stories of the same category with the same title)
  • Do a fuzzy search on the title, using something like soundex, and redirect the user to the most fitting title or show an overview page with look-alike titles like @symcbean suggests.

I prefer the ID one though.

CodeCaster
  • 147,647
  • 23
  • 218
  • 272
1

We set redirects to the search page for url's like this - our search has "suggested" feature.

Timur
  • 6,668
  • 1
  • 28
  • 37
0

Since the values are (presumably) being pulled from MySQL based on the title, you could put a FULLTEXT index on the title column, and use the MySQL MATCH() function to find the most relevant match, and redirect the user to that.

Although this will be by no means a perfect solution - the type of intelligence required to do this with any real degree of accuracy is veering dangerously close to Turing test territory.

DaveRandom
  • 87,921
  • 11
  • 154
  • 174