Get the string that is the midpoint between two other strings

Question

Is there a library or code snippet available that can take two strings and return the exact or approximate mid-point string between the two strings?

Preferably the code would be in Python.

Background:

This seems like a simple problem on the surface, but I'm kind of struggling with it:

Clearly, the midpoint string between "A" and "C" would be "B".
With base64 encoding, the midpoint string between "A" and "B" would probably be "Ag"
With UTF-8 encoding, I'm not sure what the valid midpoint would be because the middle character seems to be a control character: U+0088 c2 88 <control>

Practical Application:

The reason I am asking is because I was hoping write map-reduce type algorithm to read all of the entries out of our database and process them. The primary keys in the database are UTF-8 encoded strings with random distributions of characters. The database we are using is Cassandra.

Was hoping to get the lowest key and the highest key out of the database, then break that up into two ranges by finding the midpoint, then breaking those two ranges up into two smaller sections by finding each of their midpoints until I had a few thousand sections, then I could read each section asynchronously.

Example if the strings were base-16 encoded: (Some of the midpoints are approximate):

Starting highest and lowest keys:  '000'                'FFF'
                                   /   \              /       \
                              '000'     '8'         '8'       'FFF'
                              /   \     /  \       /  \       /   \
Result:                  '000'    '4' '4'  '8'   '8'  'B8'  'B8'  'FFF'
(After 3 levels of recursion)

This seems to be implemented in CassandraDB source code itself. See https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/OrderPreservingPartitioner.java#L47 — , May 27 '13 at 04:50
Doesn't this look like what you're searching for : http://stackoverflow.com/a/2510816/624829 — Zeugma, May 29 '13 at 10:33

score 2 · Accepted Answer · answered Jun 01 '13 at 03:00

Unfortunately not all sequences of bytes are valid UTF-8, so it's not trivial to just take the midpoint of the UTF-8 values, like the following.

def midpoint(s, e):
    '''Midpoint of start and end strings'''
    (sb, eb) = (int.from_bytes(bytes(x, 'utf-8'), byteorder='big') for x in (s, e))
    midpoint = int((eb - sb) / 2 + sb)

    midpoint_bytes = midpoint.to_bytes((midpoint.bit_length() // 8) + 1, byteorder='big')
    return midpoint_bytes.decode('utf-8')

Basically this code converts each string into an integer represented by the sequence of bytes in memory, finds the midpoint of those two integers, and attempts to interpret the "midpoint" bytes as UTF-8 again.

Depending on exactly what behavior you would like, the next step could be to replace the invalid bytes in midpoint_bytes with some kind of replacement character to form a valid UTF-8 string. For your problem it might not matter much exactly which character you use for the replacement so long as you're consistent.

However, since you're trying to partition the data and don't seem to care too much about the string representation of the midpoint, another option is to just leave the midpoint representation as an integer and convert the keys to integers while doing the partition. Depending on the scale of your problem this option may or may not be feasible.

score 2 · Answer 2 · answered Jun 01 '13 at 22:47

Here's a general solution that gives an approximate midpoint m between any two Unicode strings a and b, such that a < m < b if possible:

from os.path import commonprefix

# This should be set according to the range and frequency of
# characters used.
MIDCHAR = u'm'


def midpoint(a, b):
    prefix = commonprefix((a, b))
    p = len(prefix)
    # Find the codepoints at the position where the strings differ.
    ca = ord(a[p]) if len(a) > p else None
    cb = ord(b[p])
    # Find the approximate middle code point.
    cm = (cb // 2 if ca is None else (ca + cb) // 2)
    # If a middle code point was found, add it and return.
    if ca < cm < cb:
        return prefix + unichr(cm)
    # If b still has more characters after this, then just use
    # b's code point and return.
    if len(b) > p + 1:
        return prefix + unichr(cb)
    # Otherwise, if cb == 0, then a and b are consecutive so there
    # is no midpoint. Return a.
    if cb == 0:
        return a
    # Otherwise, use part of a and an extra character so that
    # the result is greater than a.
    i = p + 1
    while i < len(a) and a[i] >= MIDCHAR:
        i += 1
    return a[:i] + MIDCHAR

The function assumes that a < b. Other than that, it should work with arbitrary Unicode strings, even ones containing u'\x00' characters. Note also that it may return strings containing u'\x00' or other nonstandard code points. If there is no midpoint due to b == a + u'\x00' then a is returned.

Get the string that is the midpoint between two other strings

Background:

Practical Application:

2 Answers2