0

Looking at how we handle superscripts (and subscripts). I see that on the one hand they are treated like a style.

i.e.

x<sup>y</sup>

becomes:

xy

But in Unicode we seem to have superscripts and subscripts instead as individual glyphs.

For example:

x U+207f

becomes:

xⁿ

I guess it makes sense to encode common uses this way as it is more compressed. Is there a combiner (if that's the correct term) in Unicode that means treat some following symbols as superscripted and if not why not?

The context is https://langdev.stackexchange.com/a/1962/285 where we are talking about representing exponentiation in a programming language.

It would be nice to have a unicode 'value' (combiner not character?) that can represent the exponentiation operation and render it as a superscript.

So that instead of writing:

 x**y

You could write

x &xSomeValue; y

and have it render as:

Does such a thing exist in unicode and if not what is the rationale behind unicode using something else (such as only superscripts for specific glyphs) instead?

There is an existing question with an answer for one part of the question as:

"Unicode does not support making arbitrary characters into superscripts."

It does not answer the rationale part. Also it is possible the situation may have changed in the last three years.


Expand on rationale

It seems to me that a more rationale design for Unicode would be to take one of the following choices:

  • provide super and subscript versions of all characters that could exist in that position

  • provide a "super" combiner that turns the next single symbol into a superscript version of itself.

  • treat superscripts like combining glyphs into ideograms using Ideographic Description Sequences E.g.

    2^(a+b) -> 2a+b

    where ^( and ) would be special Unicode 'combiners'.

Why has Unicode chosen (if it has) not to take one or more of these approaches?

The first option requires many symbols. The second option is super simple but potentially could make more symbol representable than intended (e.g. a superscript smiley) so you might have to add rules about that. The third option is more like encode style than symbol.

What we currently have seems worse than all three. The Unicode designers are not stupid so they must be prioritising something else. What and why?


Slightly related I cannot think of a maths symbol for exponentiation. Typically we use ^ in programming. i.e.

xʸ = x^y

An up arrow has also been suggested but this doesn't look right to me:

x↑y

Another aside xʸ (x^y) is how exponentiation is typically displayed on a calculator. Why is there no Unicode codepoint for this?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Bruce Adams
  • 4,953
  • 4
  • 48
  • 111
  • note that `x↑y` is already used in math for a different meaning: [Knuth's up-arrow notation](https://en.wikipedia.org/wiki/Knuth%27s_up-arrow_notation) – phuclv Jul 05 '23 at 06:45

1 Answers1

4

The term is combining character as opposed to precomposed character. Such superscript combining characters don't exist because subscript or superscript is a formatting feature. Unicode is just a character set for mapping between characters/glyphs to numbers. It only deals with plain text and is not supposed for formatting text

Rich Text. Also known as styled text. The result of adding information to plain text. Examples of information that can be added include font data, color, formatting information, phonetic annotations, interlinear text, and so on. The Unicode Standard does not address the representation of rich text. It is expected that systems and applications will implement proprietary forms of rich text. Some public forms of rich text are available (for example, ODA, HTML, and SGML). When everything except primary content is removed from rich text, only plain text should remain.

https://unicode.org/glossary/#rich_text (emphasis mine)

You can't make a letter bold, italic or move a letter to above or below the baseline purely with the Unicode code points. Therefore it has no way to format math expressions either (except for very simple ones)

You can find more rationales from the Unicode standard:

Q: What is the difference between “rich text” and “plain text”?

A: Rich text is text with all its formatting information: typeface, point size, weight, kerning, and so on. Plain text is the underlying content stream to which formatting is applied.

One key distinction between the two is that rich text breaks the text up into runs and applies uniform formatting to each run. As such, rich text is inherently stateful. Plain text is not stateful. It should be possible to lose the first half of a block of plain text without any impact on rendering.

Unicode, by design, only deals with plain text. It doesn't provide a generalized solution to rich text issues.

Q: Why doesn't Unicode have a full set of superscripts and subscripts?

A: The superscripted and subscripted characters encoded in Unicode are either compatibility characters encoded for roundtrip conversion of data from legacy standards, or are actually modifier letters used with particular meanings in technical transcriptional systems such as IPA and UPA. Those characters are not intended for general superscripting or subscripting of arbitrary text strings—for such textual effects, you should use text styles or markup in rich text, instead.

Q: I've spotted a sign which uses superscript text for a meaningful abbreviation. Doesn't that mean that all the superscripted letters should be encoded in Unicode?

A: No. It's common for specific formatting to be used to convey some of the semantic content—the meaning—of a text. As for italics, bold, or any other stylistic effect of this sort conveying meaning, the appropriate mechanism to use in such cases is style or markup in rich text.

https://www.unicode.org/faq/ligature_digraph.html

That means you must use a math rendering tool like LaTeX, MS Equation Editor, MathType, MathML... One the simplest math renderers if you don't like LaTex is AsciiMath, but typically LaTeX is the "standard"

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • See the update to my question. I want to know more about the rationale you've answered most of it but I seek additional insight . – Bruce Adams Jul 05 '23 at 08:22
  • Also https://stackoverflow.com/q/55331246/995714 is a dead link – Bruce Adams Jul 05 '23 at 08:30
  • _"It only deals with plain text"_ [An interesting talk about plain text at NDC Oslo 2022 by Dylan Beattie](https://youtu.be/gd5uJ7Nlvvo) – phuzi Jul 05 '23 at 08:38
  • I've edited the answer to add some more info. That question has unfortunately been deleted by the bot because it got no activity but you can access it once you have enough reputation – phuclv Jul 05 '23 at 17:33
  • I believe that requires 10K rep. I've been here 10 years, I don't fancy my chances at gaining 5K repo more any time soon. Rep chasing is not a game I enjoy. – Bruce Adams Jul 05 '23 at 19:42
  • See also https://www.quora.com/Why-is-there-no-character-for-superscript-q-in-Unicode & https://www.quora.com/Why-does-Unicode-have-separate-characters-for-typographic-variants-of-some-character-like-smart-quotes-or-sub-and-superscript-numbers-and-letters-Wouldnt-it-make-more-sense-to-leave-those-features-to-the-display-in Apologies for the heresy of linking to Quora from SO. – Bruce Adams Jul 06 '23 at 16:20