7

I'm looking for an algorithm that will take a vector of strings v1 and return a similar vector of strings v2 where each string is less than x characters long and is unique. The strings in v1 may not be unique.

While I need to accept ASCII in v1, I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]) when insertion of new characters is required.

Obviously there are three caveats here:

  1. For some values of v1 and x, there is no possible unique v2. For example, when v1 has 37 elements and x == 1.

  2. "Similar" as specified in the question is subjective. The strings will be user facing, and presumably short natural language phrases (e.g. "number of colours"). I want a human to be able to map the original to the shortened string as easily as possible. This probably means taking advantage of heuristics such as disemvoweling. Because there is probably no objective measure of my similarness construct (string distance probably won't be the most useful here, although it might) my judgement on what is good will be arbitrary. The method should be suitable for English - other languages are irrelevant.

Obviously this is a (programming) language-agnostic problem, but I'd look favourably on an implementation in python (because I find its string processing language straight-forward).

fmark
  • 57,259
  • 27
  • 100
  • 107
  • 1
    What do you mean by ' I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]) when insertion of new characters is required.' – jamylak Apr 02 '12 at 06:57
  • 5
    This sounds like an interesting problem, but I'm struggling to see exactly what you're asking for. Do you think you could provide a very simple example of input and desired output? – Nolen Royalty Apr 02 '12 at 07:01
  • Also, what degree of "shortening" are we talking about? Number of colours -> nmbr of clrs, or number of colours -> n o c? – Nolen Royalty Apr 02 '12 at 12:04

3 Answers3

2

Sketch -

Develop a list of functions that reduce the size of an english string. Order the functions from least to most obscuring.

For each string in v1 repeatedly apply an obscuring function until it can no longer reduce the size of the string and then move on to the next function.

When desired size x is achieved, verify reduced string is unique with respect to strings already in v2. If so, add it to v2, if not, continue to apply obscuring functions.

Following are some ideas for size reducing functions subjectively ordered from least to most obscuring. (The random selections are intended to increase the probability of the reduced string being unique.)

  1. Replace a random occurrence of two white spaces characters with a single space
  2. Replace a random occurrence of punctuation followed by space with a single space
  3. Remove a single character word at random that is also a member of a kill list (eg "I", "a")
  4. Remove a two character word at random that is also a member of a kill list (eg "an", "of")
  5. Remove a three character word at random that is also a member of a kill list (eg "the", "and")
  6. Replace a five or more character word with word composed of first three and last character (eg "number" becomes "numr", "colours" becomes "colrs")
  7. Remove a vowel at random
  8. Remove a word that occurs in a large number of strings in v1. The idea being that very common words have low value.
  9. Translate a word/phrase to a shorter "vanity license plate" word based on a dictionary(thesaurus) ( such as http://www.baac.net/michael/plates/index.html )

(Note: the last two functions would require access to the initial unaltered string, and the correspondences between the unaltered and altered words.)

Brian Swift
  • 1,403
  • 8
  • 10
1

a few notes / pointers about doing this in python.

  1. Use the bisect module to keep your result array in order to easily spot potential non-uniques. This is helpful even if v1 is already sorted (e.g. name and enemy will collide after disenvoweling)
  2. Disemvoweling can be achieved by simple calling .translate(None, "aeiouyAEIOUY") on the string.
  3. In case of duplicates you could try to resolve the collisions first by lowercasing all results and using swapcase as "bitmask", i.e. multiple occurences of aaa become ["aaa", "aaA", "aAa", "aAA"] etc. and if this is not enough "incrementing" characters starting from the end, until a non-colliding identifier is found, eg. ["aa"]*7 would become ["aa", "aA", "Aa", "AA", "ab", "aB", "Ab"]
Kimvais
  • 38,306
  • 16
  • 108
  • 142
-1
def split_len(seq, length):
    return [seq[i:i+length] for i in range(0, len(seq), length)]
newListOfString=[]
for item in listOfStrings:
    newListOfString.append(split_len(item,8)[0])

this returns the 1st eight chars.

user850498
  • 717
  • 1
  • 9
  • 22
  • 2
    I'm pretty sure that the OP was able to figure this much out by himself. Also `[_[:8] for _ in listOfStrings]` would do the same. – Kimvais Apr 02 '12 at 06:55