9

Does anyone know of a library for encoding a number of primitive types (like integers, floats, strings, etc) into a string but preserving the lexicographical order of the types?

Ideally, I'm looking for a C++ library, but other languages are fine too. Also, one can assume that the format does not need to be encoded in the string itself (that is, if it's int64/string/float then the encoded string does not need to encode this information, only encoding the data is enough).

nilton
  • 798
  • 1
  • 8
  • 11
  • Could you clarify what you want? –  Nov 19 '09 at 03:35
  • 1
    What do you mean by lexicographic order with respect to integers and floats? Their lexicographic sorting depends on how you encode them, e.g. binary, octal, decimal, hex etc. (assuming leading digits removed) all will give different lexicographical sorts for a given list of numbers. – Barry Kelly Nov 19 '09 at 04:14
  • By lexicographical order I mean, the original order of the primitive types (not the string, obviously). Say, encode "(a, b, c)" into a string "s", such that "(a, b, c) < (a', b', c')" implies that "s < s'" for all a, b, c. – nilton Nov 20 '09 at 13:09

4 Answers4

10

Take a look at this paper ("Efficient Lexicographic Encoding of Numbers") which shows how to represent any numeric type as a string such the lexicographic order of the strings is the same as the numerical order of the underlying numbers. It copes with arbitrary length numbers.

http://www.zanopha.com/docs/elen.pdf

Peter
  • 101
  • 1
  • 3
  • Interesting... I'm taking a look at the paper. – nilton Jul 30 '10 at 05:48
  • 3
    Just implemented this. Works provided a minor modification. The ASCII `'+'` character has integer value 43, which is lower and `'0'` (integer value 48). This provides incorrect sorting semantics. By using a character that is higher up in the ASCII plane, like `'='` (integer value 61) gives correct results even when comparing strings with a different number of prefix characters. – André Caron Sep 15 '14 at 17:56
3

I had the problem of converting integers and longs to strings which preserve ordering. And since I was working in Java, I only had signed types.

My algorithm was very simple:

  1. Flip the sign bit (toEncode ^ Long.MAX_VALUE for longs) otherwise negative numbers are greater than positive numbers.
  2. Do a modified base64 encoding of the bytes. Unfortunately, the normal base64 encoding doesn't preserve ordering; the special characters (+ and /) are after the numbers which are after the characters. This is completely backwards from ASCII. My modified encoding simply uses the ASCII ordering. (To make it clear it wasn't normal base64, I changed the special chars to - and _ with ~ as the padding. These are still useable within an URL, which was another a constraint I had.)
Michael Deardeuff
  • 10,386
  • 5
  • 51
  • 74
  • Is your modified Base 64 encoder part of any open-source Java library? – markvgti Jan 09 '19 at 11:00
  • @markvgti Looks like https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding – tejasvi88 Sep 14 '21 at 13:57
  • Modifying the character mapping such that it follows ASCII ordering for https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Base64.Encoder.html did not work for me but doing it for https://guava.dev/releases/17.0/api/docs/com/google/common/io/BaseEncoding.html worked and preserved the order. – dtc Sep 29 '21 at 22:48
2

BTW ... In Amazon Web Service's SimpleDB, all data are stored as strings. Its select comparators use lexicographic ordering. AWS provides utility functions to encode various types. For example, integers are encoded knowing the range of the integers apriori and adjusting via zero-padding and offsets (e.g. for negative integers). You could of course give it the worst possible range.

See "Query 201: Tips and Tricks for Amazon SimpleDB Query" - http://aws.amazon.com/articles/1232

http://typica.s3.amazonaws.com/com/xerox/amazonws/sdb/DataUtils.html

Ken Lin
  • 1,819
  • 21
  • 21
-1

Just write numeric values in a fixed column width with leading zeros, and strings as normal. So like this:

0.1 -> 0000000.1000000
123 -> 0000123.0000000
foo -> foo
X   -> X

Then you can sort as text (e.g. Unix sort without -n). How about that?

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I would like to avoid encoding numbers in fixed width. Also, encoding strings as themselves won't work give the right sorting order if the string has the same character that you're using as a separator. – nilton Nov 20 '09 at 13:14
  • Then write your own sort routine. – John Zwinck Nov 20 '09 at 13:38