0

I have read and watched many youtube videos & links which all provide same solution which is:

  1. Use a distributed counter like zookeeper
  2. Counter max limit can be 3.5 trillion
  3. Convert the Counter value to Base62

which is all fine when the counter value is small. e.g. generated counter value: 120001 => base62 value FMJQmhBR

but when the counter provides large counter value like below the base62 value length also increases. generated counter value: 120003658=> base62 value HRGZF8RiHC6y

So how can this be a solution for exact tiny url with exact 8 length.

https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html https://www.youtube.com/watch?v=eCLqmPBIEYs https://www.youtube.com/watch?v=JQDHz72OA3c&t=1862s

Navin
  • 684
  • 1
  • 11
  • 24

2 Answers2

0

Given that you are not hashing every url, but a vaguely-predictable number, you could hash the result and take the first N bits

However, there are many solutions for what to do for collisions

  • ignore them - they will be rare (ideally)
  • choose the next value
  • hash the result again (with your input)
  • increment the size of the returned string
  • ...

Here's a great video on cuckoo hashing (which is a structure of hashes relevant here):
https://www.youtube.com/watch?v=HRzg0SzFLQQ


Here's an example in Python which finds an 8-character string from the hash which should be fairly unique (this could then be collected into a sorted data structure mapping it into a URL)

This works by first hashing the value with an avalanching hash (SHA-265) and then loops to find the minimum amount of it (slice from the front of the hex string) to form an 8-char base62 string

This could be made much more efficient (even, for example by bisecting), but may be clearer as-is and depends hugely on unspecified algorithm requirements

import hashlib

BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

m = hashlib.sha256()
m.update("https://stackoverflow.com/questions/65714033/tiny-url-system-design".encode())
digest  = m.digest()    # hash as bytes       b',\xdb3\x8c\x98g\xd6\x8b\x99\xb6\x98#.\\\xd1\x07\xa0\x8f\x1e\xb4\xab\x1eg\xdd\xda\xd6\xa3\x1d\xb0\xb2`9'
hex_str = digest.hex()  # string of hex chars 2cdb338c9867d68b99b698232e5cd107a08f1eb4ab1e67dddad6a31db0b26039

for hashlen in range(100, 1, -1):
    number = int(hex_str[:hashlen], 16)  # first_n_chars(str(hex)) -> decimal
    val = ""
    while number != 0:
        val = "{}{}".format(BASE62[number % 62], val)  # append new chars to front
        number = number // 62  # integer division
    if len(val) <= 8:
        break

print(val)  # E0IxW0zn

base62 logic from How to fix the code for base62 encoding with Python3?

ti7
  • 16,375
  • 6
  • 40
  • 68
  • Understood your solution. but I am trying to understand the solution which all those article & system design video's says. e.g. https://dev.to/rajdeepc/how-to-design-a-url-shortener-web-app-29gp I just says pass the counter to base62 service and get 7 length unique value. How the 7 character length value is generated for a long counter number like e.g. 1200004567 – Navin Jan 14 '21 at 06:05
  • 1
    If you pass the counter, then you're not passing a 7 character-long string, you're just passing a number, and that's really short https://dotnetfiddle.net/3X0DfW – Martheen Jan 14 '21 at 06:07
  • @Martheen nice example. but are you saying is that what the base62 service mentioned in dev.to/rajdeepc/how-to-design-a-url-shortener-web-app-29gp does? convert to byte and then encode to base62? – Navin Jan 14 '21 at 06:35
  • 1
    Ah, sorry, the GetBytes is an artefact from my previous sanity check. That line not actually used, the code simply take counter as is and run a while loop on it, so there's no explicit "convert to byte" needed. – Martheen Jan 14 '21 at 06:48
  • @Martheen got it. My bad. Thanks! – Navin Jan 14 '21 at 06:54
  • I added a Python example which hashes a url and then finds the least hash information needed to make an 8-char encoding – ti7 Jan 14 '21 at 07:03
0

First: there absolutely is a compression limit. If your chosen representation has a maximum length, that imposes a hard limit on your key space.

Let's unpack that a little. Let's say you've got 80 guests at a party, and you want to give each guest a unique label (for their drink cup or something). If you've decided that each label will be a single letter from the English alphabet, then you only have enough unique labels for 26 guests.

Second: FMJQmhBR is not the most efficient way to represent the number 120001. It takes 17 bits in binary: 11101010011000001 (not sure which endianness that is). 16 bits is just two ASCII characters, and three ASCII characters can accommodate nearly 17 million unique values. And that's without any kind of special, ZIP-like compression.

--

I think most URL shorteners work essentially by assigning a counting number to each URL that someone shortens. So, the very first URL that gets submitted will be given ID=1: they save the whole URL in the database and associate it with that number. The second URL gets ID=2, etc.

That's pretty crude, though. For a variety of reasons, they don't want to hand those IDs out in order. But if they know how long they want the identifiers to be, it's not hard hand those IDs out in random order:

  • When someone submits a URL, the system picks a random number between 0 and the highest-possible ID. If the URL identifiers are all supposed to be 8 ASCII characters, that means they pick a random number between 0 and 2^(8*8) = 1.844674407e19.
  • Then they check their DB to see if they've handed out that ID. If they have, they pick a different random number. They repeat this until they pick an ID that hasn't been handed out. (I think there are more efficient algorithms for this, but the effect is the same and this is easiest to understand.)
Tom
  • 8,509
  • 7
  • 49
  • 78
  • 1
    Arbitrary ASCII characters are not suitable for making short URLs; some will have to be %-encoded making the URLs longer anyway, and many others cannot be typed on all standard keyboards (in case the URL is being read aloud over the phone or copied from another screen). – kaya3 Jan 14 '21 at 06:14