I'm looking for an algorithm that will take a vector of strings v1
and return a similar vector of strings v2
where each string is less than x
characters long and is unique. The strings in v1
may not be unique.
While I need to accept ASCII in v1
, I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]
) when insertion of new characters is required.
Obviously there are three caveats here:
For some values of
v1
andx
, there is no possible uniquev2
. For example, whenv1
has 37 elements andx == 1
."Similar" as specified in the question is subjective. The strings will be user facing, and presumably short natural language phrases (e.g. "number of colours"). I want a human to be able to map the original to the shortened string as easily as possible. This probably means taking advantage of heuristics such as disemvoweling. Because there is probably no objective measure of my similarness construct (string distance probably won't be the most useful here, although it might) my judgement on what is good will be arbitrary. The method should be suitable for English - other languages are irrelevant.
Obviously this is a (programming) language-agnostic problem, but I'd look favourably on an implementation in python (because I find its string processing language straight-forward).