9

OK, I'm sure somebody, somewhere must have come up with an algorithm for this already, so I figured I'd ask before I go off to (re)invent it myself.

I have a list of arbitrary (user-entered) non-empty text strings. Each string can be any length (except 0), and they're all unique. I want to display them to the user, but I want to trim them to some fixed length that I decide, and replace part of them with an ellipsis (...). The catch is that I want all of the output strings to be unique.

For example, if I have the strings:

  • Microsoft Internet Explorer 6
  • Microsoft Internet Explorer 7
  • Microsoft Internet Explorer 8
  • Mozilla Firefox 3
  • Mozilla Firefox 4
  • Google Chrome 14

then I wouldn't want to trim the ends of the strings, because that's the unique part (don't want to display "Microsoft Internet ..." 3 times), but it's OK to cut out the middle part:

  • Microsoft...rer 6
  • Microsoft...rer 7
  • Microsoft...rer 8
  • Mozilla Firefox 3
  • Mozilla Firefox 4
  • Google Chrome 14

Other times, the middle part might be unique, and I'd want to trim the end:

  • Minutes of Company Meeting, 5/25/2010 -- Internal use only
  • Minutes of Company Meeting, 6/24/2010 -- Internal use only
  • Minutes of Company Meeting, 7/23/2010 -- Internal use only

could become:

  • Minutes of Company Meeting, 5/25/2010...
  • Minutes of Company Meeting, 6/24/2010...
  • Minutes of Company Meeting, 7/23/2010...

I guess it should probably never ellipsize the very beginning of the strings, even if that would otherwise be allowed, since that would look weird. And I guess it could ellipsize more than one place in the string, but within reason -- maybe 2 times would be OK, but 3 or more seems excessive. Or maybe the number of times isn't as important as the size of the chunks that remain: less than about 5 characters between ellipses would be rather pointless.

The inputs (both number and size) won't be terribly large, so performance is not a major concern (well, as long as the algorithm doesn't try something silly like enumerating all possible strings until it finds a set that works!).

I guess these requirements seem pretty specific, but I'm actually fairly lenient -- I'm just trying to describe what I have in mind.

Has something like this been done before? Is there some existing algorithm or library that does this? I've googled some but found nothing quite like this so far (but maybe I'm just bad at googling). I have to believe somebody somewhere has wanted to solve this problem already!

Ken
  • 726
  • 1
  • 5
  • 7

2 Answers2

3

It sounds like an application of the longest common substring problem.

Replace the longest substring common to all strings with ellipsis. If the string is still too long and you are allowed to have another ellipsis, repeat.

You have to realize that you might not be able to "ellipsize" a given set of strings enough to meet length requirements.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • Hmm, that's not a bad starting point, but I don't think it's quite what I was going for. Maybe my examples weren't chosen to make this clear, but I don't require that ellipses only replace equal substrings: only that the output strings are unique. For example, if given the two inputs "Herzkreislaufwiederbelebung" and "Geschwindigkeitsbegrenzung", and I wanted to trim to length=12 (including dots), it would be fine to return "Herzkreis..." and "Geschwind...". – Ken Feb 14 '11 at 20:37
  • @Ken Sounds like you could just sluggize them. – Orbling Feb 14 '11 at 20:56
  • @Ken - Right, your examples were clear but I guess my thinking was a little fuzzy. I got off track trying to come up with examples that couldn't be shortened enough and still retain uniqueness. – erickson Feb 14 '11 at 21:38
0

Sort the strings. Keep the first X characters of each string. If this prefix is not unique to the string before and after, then advance until unique characters (compared to the string before and after) are found. (If no unique characters are found, the string has no unique part, see bottom of post) Add ellipses before and after those unique characters.

Note that this still might look funny:

Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook

I don't know what language you're looking to do this in, but here's a Python implementation.

def unique_index(before, current, after, size):
    '''Returns the index of the first part of _current_ of length _size_ that is 
        unique to it, _before_, and _after_. If _current_ has no part unique to it,
        _before_, and _after_, it returns the _size_ letters at the end of _current_'''
    before_unique = False
    after_unique = False
    for i in range(len(current)-size):
        #this will be incorrect in the case mentioned below
        if i > len(before)-1 or before[i] != current[i]:
            before_unique = True
        if i > len(after)-1 or after[i] != current[i]:
            after_unique = True
        if before_unique and after_unique:
            return i

    return len(current)-size

def ellipsize(entries, prefix_size, max_string_length):
    non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.

    #If you want to preserve order then make a copy and make a mapping from the copy to the original
    entries.sort()

    ellipsized = []

    # you could probably remove all this indexing with something out of itertools
    for i in range(len(entries)):
        current = entries[i]

        #entry is already short enough, don't need to truncate
        if len(current) <= max_string_length:
            ellipsized.append(current)
            continue

        #grab empty strings if there's no string before/after
        if i == 0:
            before = ''
        else:
            before = entries[i-1]
        if i == len(entries)-1:
            after = ''
        else:
            after = entries[i+1]

        #Is the prefix unique? If so, we're done.
        current_prefix = entries[i][:prefix_size]    
        if not before.startswith(current_prefix) and not after.startswith(current_prefix):
            ellipsized.append(current[:max_string_length] + '...') #again, possibly -3

        #Otherwise find the unique part after the prefix if it exists.
        else:
            index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
            if index == prefix_size:
                header = ''
            else:
                header = '...'
            if index + non_prefix_size == len(current):
                trailer = ''
            else:
                trailer = '...'
            ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
    return ellipsized

Also, you mention the string themselves are unique, but do they all have unique parts? For example, "Microsoft" and "Microsoft Internet Explorer 7" are two different strings, but the first has no part that is unique from the second. If this is the case, then you'll have to add something to your spec as to what to do to make this case unambiguous. (If you add "Xicrosoft", "MXcrosoft", "MiXrosoft", etc. to the mix with these two strings, there is no unique string shorter than the original string to represent "Microsoft") (Another way to think about it: if you have all possible X letter strings you can't compress them all to X-1 or less strings. Just like no compression method can compress all inputs, as this is essentially a compression method.)

Results from original post:

>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
    print entry

Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
    print entry

Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...
user470379
  • 4,879
  • 16
  • 21
  • I don't understand. First X chars of which string? Unique chars where? How does this help with the case (above) where there are only 2 strings but every character is unique? – Ken Feb 14 '11 at 23:40
  • I just added a lot to my answer to flesh it out. – user470379 Feb 15 '11 at 00:04
  • 1
    I'm still working on the code, but the compression comment is an odd one. This is essentially a *lossy* compression method, and lossy compression can definitely compress all inputs. This case is a bit more complex because I want the outputs to be unique, but then the compression of an input token here depends entirely on the other tokens in the input, and given certain reasonable constraints (e.g., number of inputs will always be vanishingly small compared to number of possible strings), it doesn't seem inherently impossible. – Ken Feb 15 '11 at 00:14
  • @Ken That's true, I hadn't thought about the fact it was a lossy compression algorithm. But the fact still remains, you can't convert >2^n inputs to all unique values less than n bits. – user470379 Feb 15 '11 at 00:19
  • 1
    @Ken Thinking about it some more, it's not really a lossy compression algorithm. You want each string of the output to refer to a __single unique__ string in the original. I assert that for every shortened string you use in the output to refer to the original string from the input, I can add a single string to the input such that the output string refers to the original input and the added input. – user470379 Feb 15 '11 at 01:47
  • Sure, that assertion is true; that's basically part of the problem statement. (That's what is meant by "the compression of an input token here depends entirely on the other tokens in the input".) It won't be reversible. It doesn't even have to be streamable. I think it still qualifies as lossy compression, as (a) it's an encoding of data using fewer symbols than the original, and (b) it necessarily discards data to achieve this -- but if you don't like that name then we can call it something else. :-) – Ken Feb 15 '11 at 02:29
  • @Ken But you __do__ want it to be reversible. Otherwise I could give you "Micros..Int..." "Micros...nte..." and "Micros...ter...". Those are three unique strings formed by shortening the input strings. If your objection to that is that you can't tell which one corresponds to 6, which one corresponds to 7, and which one corresponds to 8, then you do in fact want it to be reversible. – user470379 Feb 15 '11 at 03:24
  • I'm not sure I'd call it "reversible" or "lossless" if the only way to recover the original requires having a copy of the original to compare against. :-) – Ken Feb 15 '11 at 17:08
  • @Ken I never called it lossless. And I'm pretty sure you want a reversible function, otherwise you would just be using a MD5 hash. Just trying to help you get a better feel for what you want this algorithm to do, as I sense based on your original post that you're pretty unsure and still trying to feel it out. – user470379 Feb 15 '11 at 17:55