Transforming sequential integers to non-sequential 4-character tokens is fairly easy. If you use a reversible algorithm, then you can also easily transform these tokens back into sequential integers that could be used to retrieve URLs from a database.
Note: If you're planning to open up a public URL shortening service, then tokens of just 4 alphanumeric characters could be exhausted rather quickly. But for a personal or company website, they should be more than adequate. The method described below will also work for longer tokens, but if you really need a system that can store billions (or trillions) of URLs then you'll need to think more carefully about how you're going to organize all this data.
A linear congruential generator is a good way of obfuscating numbers. If you're working with the range from 0 to 624−1, then obviously your modulus m will be 624. And since the prime factors of m are 2 and 31, the multiplier a will have to be one more than a multiple of 124 (as explained here). The value of c can be any non-zero value that is relatively prime to m. For example:
function lcg($n) {
# (10345073 - 1) % 124 == 0
$m = 14776336; # = 62**4
$a = 10345073;
$c = 8912423;
$n = ($n * $a + $c) % $m;
return $n;
}
The inverse function is fairly similar. Instead of a, it uses its modular multiplicative inverse (mod m), and instead of c, it uses m−c:
function lcg_inv($n) {
# (10345073 * 5661345) % (62**4) == 1
$m = 14776336; # = 62**4
$a_ = 5661345;
$c_ = 5863913; # = $m-8912423
$n = (($n + $c_) * $a_) % $m;
return $n;
}
Since LCGs are quite easy to predict from just a few output values, you can add another layer of obfuscation by randomizing the order of symbols used to represent these numbers in base 62 (e.g., W3qVL...
instead of abcde...
)
function int_2_token($n) {
$alf = 'W3qVLpEKDxn8vzG0SQPfIX2yO51JsHBYCRbouTatZ4hMdlmF67UcNiAgwke9jr';
$tok = '';
if ($n < 0 || $n >= 62**4) return ''; # Value out of range
$n = lcg($n);
for ($i=0; $i<4; $i++) {
$r = $n % 62;
$tok .= $alf[$r];
$n = ($n - $r) / 62;
}
return $tok;
}
function token_2_int($tok) {
$t = [ '0'=>15, '1'=>26, '2'=>22, '3'=>1, '4'=>41, '5'=>25, '6'=>48, '7'=>49,
'8'=>11, '9'=>59, 'A'=>54, 'B'=>30, 'C'=>32, 'D'=>8, 'E'=>6, 'F'=>47,
'G'=>14, 'H'=>29, 'I'=>20, 'J'=>27, 'K'=>7, 'L'=>4, 'M'=>43, 'N'=>52,
'O'=>24, 'P'=>18, 'Q'=>17, 'R'=>33, 'S'=>16, 'T'=>37, 'U'=>50, 'V'=>3,
'W'=>0, 'X'=>21, 'Y'=>31, 'Z'=>40, 'a'=>38, 'b'=>34, 'c'=>51, 'd'=>44,
'e'=>58, 'f'=>19, 'g'=>55, 'h'=>42, 'i'=>53, 'j'=>60, 'k'=>57, 'l'=>45,
'm'=>46, 'n'=>10, 'o'=>35, 'p'=>5, 'q'=>2, 'r'=>61, 's'=>28, 't'=>39,
'u'=>36, 'v'=>12, 'w'=>56, 'x'=>9, 'y'=>23, 'z'=>13 ];
$n = 0;
if (!preg_match('/^[a-z0-9]{4}$/i', $tok)) return -1; # Invalid token
for ($i=3; $i>=0; --$i) {
$n = $n * 62 + $t[$tok[$i]];
}
return lcg_inv($n);
}
So when you get a new URL to shorten, insert it into your database with an auto-incrementing ID value, and pass this ID value to int_2_token()
to obtain a four-character token to use in the shortened URL. When this shortened URL is requested, pass the token to token_2_int()
to recover this ID so you can fetch the original URL.
Note: Don't forget that the set of all four-character tokens includes the entire set of all four-letter words. You will probably want to make sure that your URL shortener doesn't output anything vulgar or offensive.