0

So I have been trying to build a URL shortner but I am unable to generate unique but random strings. I have look everywhere for a solution but couldn't find one thus posting it here.

I have a table in which I get auto-generated sequential primary key (ID) against a inserted record. Now I take that ID and run a bijective function on it that turns

0  → a
1  → b
...
25 → z
...
52 → 0
61 → 9

Now the issue is that the generated string is not random. For example :

63 --> b1
64 --> b2
...
1836 --> bpa
1836 --> bpb

Which is very guessable. I have even tried to encode the ID to Base64 but the resultant string is again guessable and if I use GUID instead and encode it to Base64 then the resultant string is very large. The max string should be of 7,8 characters - ideally 3,4 chars.

I am wondering how does bit.ly does it? their generated short URL is always unique and random.

NewbieProgrammer
  • 874
  • 2
  • 18
  • 50
  • I see, I was looking at CRC32 and it generates 8 characters string. Will that be good enough if no better solution is available? Since each ID is unique, CRC32 encoded string will also be unique. right ? – NewbieProgrammer Apr 28 '20 at 08:27
  • No, CRC encoded string will not be unique. The CRC you're looking at probably generates a 32-bit number, which is displayed as 8 hex digits. So there are 2^32 possible CRC values. If you have more than that many possible IDs, you will have collisions. – Jim Mischel Apr 28 '20 at 13:57

2 Answers2

3

What you do is generate sequential keys. The first URL is 1, next is 2, etc. But you obfuscate those keys by uniquely mapping each one to a different number. So 1 might become 76839427, and 2 might become 9935. Then you base64 encode the number.

The nice thing is that all you have to keep track of is the next sequential number. The process is reversible, so you can turn 9935 back into 2.

I give an example of the mapping in Efficient algorithm for generating unique (non-repeating) random numbers.

Another possibility is to use a Linear Feedback Shift Register with a long period. You can create one with a period of 2^64. Guaranteed not to repeat until you've generated 2^64 numbers.

Note that neither of these is truly "random." They're methods of obfuscation, but given enough effort, somebody could crack the algorithm. But then, they could crack a pseudo-random number generator, too.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
3

Transforming sequential integers to non-sequential 4-character tokens is fairly easy. If you use a reversible algorithm, then you can also easily transform these tokens back into sequential integers that could be used to retrieve URLs from a database.

Note: If you're planning to open up a public URL shortening service, then tokens of just 4 alphanumeric characters could be exhausted rather quickly. But for a personal or company website, they should be more than adequate. The method described below will also work for longer tokens, but if you really need a system that can store billions (or trillions) of URLs then you'll need to think more carefully about how you're going to organize all this data.

A linear congruential generator is a good way of obfuscating numbers. If you're working with the range from 0 to 624−1, then obviously your modulus m will be 624. And since the prime factors of m are 2 and 31, the multiplier a will have to be one more than a multiple of 124 (as explained here). The value of c can be any non-zero value that is relatively prime to m. For example:

function lcg($n) {
    # (10345073 - 1) % 124 == 0
    $m = 14776336; # = 62**4
    $a = 10345073;
    $c = 8912423;
    $n = ($n * $a + $c) % $m;
    return $n;
}

The inverse function is fairly similar. Instead of a, it uses its modular multiplicative inverse (mod m), and instead of c, it uses mc:

function lcg_inv($n) {
    # (10345073 * 5661345) % (62**4) == 1
    $m = 14776336; # = 62**4
    $a_ = 5661345;
    $c_ = 5863913; # = $m-8912423
    $n = (($n + $c_) * $a_) % $m;
    return $n;
}

Since LCGs are quite easy to predict from just a few output values, you can add another layer of obfuscation by randomizing the order of symbols used to represent these numbers in base 62 (e.g., W3qVL... instead of abcde...)

function int_2_token($n) {
    $alf = 'W3qVLpEKDxn8vzG0SQPfIX2yO51JsHBYCRbouTatZ4hMdlmF67UcNiAgwke9jr';
    $tok = '';
    if ($n < 0 || $n >= 62**4) return ''; # Value out of range
    $n = lcg($n);
    for ($i=0; $i<4; $i++) {
        $r = $n % 62;
        $tok .= $alf[$r];
        $n = ($n - $r) / 62;
    }
    return $tok;
}

function token_2_int($tok) {
    $t = [ '0'=>15, '1'=>26, '2'=>22, '3'=>1,  '4'=>41, '5'=>25, '6'=>48, '7'=>49,
           '8'=>11, '9'=>59, 'A'=>54, 'B'=>30, 'C'=>32, 'D'=>8,  'E'=>6,  'F'=>47,
           'G'=>14, 'H'=>29, 'I'=>20, 'J'=>27, 'K'=>7,  'L'=>4,  'M'=>43, 'N'=>52,
           'O'=>24, 'P'=>18, 'Q'=>17, 'R'=>33, 'S'=>16, 'T'=>37, 'U'=>50, 'V'=>3, 
           'W'=>0,  'X'=>21, 'Y'=>31, 'Z'=>40, 'a'=>38, 'b'=>34, 'c'=>51, 'd'=>44,
           'e'=>58, 'f'=>19, 'g'=>55, 'h'=>42, 'i'=>53, 'j'=>60, 'k'=>57, 'l'=>45,
           'm'=>46, 'n'=>10, 'o'=>35, 'p'=>5,  'q'=>2,  'r'=>61, 's'=>28, 't'=>39,
           'u'=>36, 'v'=>12, 'w'=>56, 'x'=>9,  'y'=>23, 'z'=>13 ];
    $n = 0;
    if (!preg_match('/^[a-z0-9]{4}$/i', $tok)) return -1; # Invalid token
    for ($i=3; $i>=0; --$i) {
        $n = $n * 62 + $t[$tok[$i]];
    }
    return lcg_inv($n);
}

So when you get a new URL to shorten, insert it into your database with an auto-incrementing ID value, and pass this ID value to int_2_token() to obtain a four-character token to use in the shortened URL. When this shortened URL is requested, pass the token to token_2_int() to recover this ID so you can fetch the original URL.


Note: Don't forget that the set of all four-character tokens includes the entire set of all four-letter words. You will probably want to make sure that your URL shortener doesn't output anything vulgar or offensive.

r3mainer
  • 23,981
  • 3
  • 51
  • 88