0

I have written a PHP encode/decode bijective function that simply takes a number and encode/decodes it with base-58 with a custom alphabet.

This shortener works fine, but I want to be able to restrict certain words and have the ability to create custom vanity urls.

This should mean that the user will not get their link reloved to domain.com/ boobs or something.

Also I want to be able to have domain.com/stackoverflow resolve to domain.com/12342 without disrupting the bijective function.

PROPOSED SOLUTION

I had a couple proposals but they dont seem optimal. One way I thought of doing it is storing the custom urls in the database so 1234 => mycoolurl and then upon encode/decode, lookup to see if it already exists. if it does, offset the number by like 10,000,000 (so it would become 10,001,234 and then encode/decode it. This makes some links much longer than others and sets a hard limit at 10,000,000 links (which is practically ok, but still not very elegant). To solve the curse words issue, I can insert dummy links in the DB.

I'd love to hear your input!

mrBorna
  • 1,757
  • 16
  • 16

1 Answers1

0

The way I see it the bijective function is only part of a shortener, and both of your issues are outside of the function's responsibility.

I think you can address the curse-word issue by excluding all vowels from your custom alphabet (thus changing base-58 to base-48 and sacrificing in URL shortness), but that's probably all you can do inside the function itself.

If we take shortening algorithm as a whole, assuming the most obvious variant, with a key-value table (or other storage):

  1. get the incoming URL
  2. generate a random key number, check the storage if it is already used, regenerate and repeat if needed
  3. store the key and incoming URL in the storage
  4. Apply the bijection function to get the short URL path

then the curse-words issue is easily solvable by checking the resulting path against a list of stop words/regexes and regenerating the random number if it matches.

As for the vanity URLs, this is solvable by applying the bijection function to the desired path to get the key number and using it in step 2 instead of a random number, unless I'm missing something. Of course, you should prepare for possible conflicts or pre-reserve the list of vanity short URLs to something like domain.com/reserved. Also, to get the long enough vanity words you will obviously need a large enough key space - for 4-byte ints you get like 5 chars max.

Another option is to remove vanity words from the shortener (by adding them into the stop list) and implementing separate aliasing function that does not use the bijection function but only stores (vanity URL, short URL) pairs.

vokhot
  • 11
  • 1