0

Given an interval defined by two strings, [x, y], and third string s between them, is there a way to calculate the percentage of the whole interval from x to s. Preferably which honors collation (case matters vs not, for instance). An approximate answer is reasonable.

For example, given the strings 'a' and 'c', 'b' is halfway across, in the normal Latin-1 collation, so we'd expect an answer of 50%.

The obvious, and wrong, way is just to trust the encoding to carry the day. Unfortunately that ignores the fact the in a case insensitive collation, 'B' is in the interval ['a', 'c'], and is equivalent to 'b', even though 'B' is encoded as a higher number than 'c'. So the encoding doesn't have this information unless we go through some normalization, which might be expensive.

I'm hoping someone has thought of a better way. It seems like something that should come up in database implementation quite a bit, but I haven't seen anything in the literature, or online, alluding to this. To be fair, it's entirely possible I'm looking in the wrong places and under the wrong names. String distance questions seem to be dominated by edit distance, not this sort of collation related distance.

It's also possible that the question depends on the encoding, in addition to the collation. In that case, I'm most interested in the various UTF encodings.

Joel
  • 5,618
  • 1
  • 20
  • 19
  • 2
    Any chance of an [MCVE](https://stackoverflow.com/help/mcve)? – Bohemian Mar 03 '22 at 23:08
  • It's up there in the text, second paragraph --> In Latin-1, using the input 'b' and the interval ['a', 'c'], I'd expect to get 0.5 (or 50%) back because 'b' is halfway across the interval. – Joel Mar 03 '22 at 23:11
  • That's not an MCVE. That's a test case expressed in English. An MCVE is runnable code. Please also add the appropriate database tag to your question - DBs have different capabilities. – Bohemian Mar 03 '22 at 23:13
  • Since this is a database engine implementation question, it's not beholden to an existing database, nor am I asking for a database that solves the problem. Similarly, there is no runnable code because this is a general algorithm question, not something that is currently running in any code I have access to. I could create phony function names with no implementation, but that doesn't seem particularly useful. In short, an MCVE doesn't seem appropriate here under your definition. – Joel Mar 03 '22 at 23:18
  • 3
    maybe give more examples, like what's the expected answer for ["Hello", "world!"] and the input "database-agnostic" ? – Alois Christen Mar 04 '22 at 09:29
  • 4
    "Given an interval defined by two strings, [x, y]," then you should give explicitly how such interval is defined, since there is no standard definition of it. – Renzo Mar 04 '22 at 10:41
  • Why would a database need to calculate a percentage distance? I have never seen the need for this. – Dialecticus Mar 04 '22 at 16:30
  • The query optimizer is one. Column statistics are usually histograms which have bucket ranges defined by a couple hundred endpoints. When you want to estimate a range within a bucket, you need to decide how many of the rows stored in the bucket belong to the range. The usual solution is to assume uniformity within the bucket and return the fraction represented by the solution above. For numeric, this is straight forward. For strings, less so. – Joel Mar 05 '22 at 21:41
  • @Renzo I'm not sure I understand the question, beyond noting that the collation (sort order and equalities) and byte encoding should be set ahead of time -- in the example above, when the histogram is built. With the collation set, two endpoints should make the range and positioning of the search value unambiguous. – Joel Mar 05 '22 at 21:48

0 Answers0