Is there a character encoding or a markup language with a lettercase modifier?

Question

Not sure if this fits here. If there is something like "Computer history and future" please direct me there.

Question

Since the rise of computers, were there any character encodings (or markup languages on top of that), that differentiate between uppercase and lowercase letters, but not by defining the entire alphabet twice (once in capitals and once in lowercase letters), but by adding a modifier or keyword that specifies a character to be in a specific case.

Why Would Someone Do This?

Maybe to encod text in less space, or simply because the authors considered the choice between ABC and abc more cosmetic than meaningful, which brings me to a lengthy and philosophical background explanation, see next section:

Skip everything from here if you are not interested in how I came up with this question.

Representation and Meaning

"Modern" encodings like ASCII and UTF-8 differentiate between uppercase and lowercase by assigning individual code points to each. This fundamental decision is so ubiquitous today, that concepts like case sensitivity appear rather natural to us. But when comparing Morse code, ASCII and Unicode, there are are a lot of distinctions that were traditionally stored in markup languages on top of the plain text encoding (e.g. rtf, tex, html, doc) but could be stored in plain text today:

Letter casing ABC, abc
Style ABC, , , ℂ, , ℭ,
Decorations ABC, A̲B̲C̲, A̶B̶C̶, A̷B̷C̷
Color

Very old encodings like Braille and Morse code do not encode letter casing, but ASCII does. In fact, it forces you to pick either capitals or lowercase letters. There is no definitive default style if you don't care.

Unicode and its UTF encodings often continued on that route by forcing you to differentiate not only between letter cases, but also between regular, italic, bold; sans-serif, serif; script, Fraktur; and more. But Unicode also supports modifiers. Instead of defining the entire alphabet again, only underlined/colored/..., there are combining characters that behave similar to keywords in markup languages. A special (sequence of) code points indicates that the next symbol should be underlined / have a different color / ... .

Unicode aims at encoding meaning and not representation. We have all these seemingly cosmetic variants in Unicode, because they convey a different meaning to someone. However, the more "meaningful" distinctions are made, the more I get a feeling that standardizing meaning without representation is impossible. Some examples:

Purely cosmetic representation that became standardized meaning

Lowercase letters were invented as a script. If lowercase was invented today, we would simply call it a "Font" and consider it to be purely cosmetic.
Mathematicians used bold letters for non-scalar variables. As writing bold by hand was tedious, some teachers drew only the outline of those bold letters, resulting in double struck blackboard bold fonts (ℂ). Nowadays, the styles N and ℕ imply very different things to mathematicians.

Standardized meaning that changed based on the representation

You might be irritated at your grandma writing things like "Your grandpa died " because she misinterpreted the emoji as being sad. But let's be honest, do you really know the standardized meaning of emojis, or do you simply use them like the people around you, turning them into a mix of inside jokes and a full-blown cant. and might be popular emojis, but not because they are standardized as eggplant and peach. And if you are using to express anger, are you really better than your grandma?

An obscure mix of both

The variant pi ϖ probably started out as a cursive/curly lowercase pi π, but might have been misread as an overlined lowercase omega ω and is therefore known as pomega and drawn more like an omega than a pi.

In an alternate universe ...

I wondered if history could have taken another turn, where people looked at these problems and thought: You know what? We cannot tell cosmetics and meaning apart. So lets try to create an encoding for the plainest of plain texts where you cannot even distinguish between uppercase and lowercase. Then add another encoding or markup language on top, that offers tons of modifiers or keywords to express whatever cosmetics you like.

In such a world, "plain text" could mean something like "a sequence of "regular" keystrokes" where computer keyboards send standardized and internationally unique scan codes.

I used an NLP system with roots back in the Lisp days which used `*` to mark the next character as upper case. I don't know if this convention was more widely used back in those days. — tripleee, Apr 24 '22 at 17:54
@tripleee Thank you very much. This was exactly what I was looking for. Do you have any names or references to those system or their character encodings (if they were standardized at all)? I would love to read up on those. — Socowi, Apr 24 '22 at 19:01
See e.g. https://web.stanford.edu/~laurik/publications/ciaa-2000/fst-in-nlp/fst-in-nlp.html which discusses several of the tools in the toolchain, though no examples of this particular convention. — tripleee, Apr 24 '22 at 19:24

score 0 · Accepted Answer · answered Apr 24 '22 at 17:23

were there any character encodings (or markup languages on top of that), that differentiate between uppercase and lowercase letters, but not by defining the entire alphabet twice (once in capitals and once in lowercase letters), but by adding a modifier or keyword that specifies a character to be in a specific case.

That's pretty much exactly how ASCII works. The letter "A" is the bit-sequence 1x0 0001. The x defines which letter-case you want. Similarly, 000 0001 is "Control-A". It's also no accident that 001 0001 is the digit 1 (the digit equivalent of "A"). The leading two bits of ASCII sequences establish the kind of character the next 5 bits identify. The kind of modifier you're describing is sent in every byte. This is entirely on purpose. It allows extremely efficient hardware implementations for printing characters on a teletype.

This can be very good for normalizing letters in search. You can just set bit 6 to 0 (or ignore bit 6), and then upper and lowercase letters are the same.

In a different way, TTS (Teletypesetting) systems also had what you're describing. It was a modified Baudot code with an extra rail that allowed encoding of both upper-vs-lower case letters and standard-vs-italic.

The key feature of Baudot code is that it shifts modes using LTRS and FIGS codes. (Be very careful when researching Baudot codes. "Lower case" generally means "uppercase letters" and "upper case" generally means figures. These harken back to the more literal meaning of "case.")

6-unit TTS extended this by adding an additional rail allowing a "double-shift" to set letter-case and formatting (italic). This is very close to what you're describing.

The big disadvantage to the "shift" approach is that it's not self-synchronizing. If you jump into the middle of a stream, you don't know how to display the characters because you don't know what mode you're in. So it's very nice to send modifiers directly on the code. But this makes codes larger.

Many of the things you describe in your question about Unicode aren't quite for the reasons you're suggesting. For example, is MATHEMATICAL SCRIPT CAPITAL A, which expressly is not for style, but to convey specific semantic meaning. ("The characters in this block are intended for use only in mathematical or technical notation, and not in nontechnical text.") exists for backward compatibility with previous Japanese standards. It's not an indication that Unicode intends to encode cosmetics. ("Nearly all of the enclosed and square symbols in the Unicode Standard are considered compatibility characters, encoded for interoperability with other characters sets.")

So lets try to create an encoding for the plainest of plain texts where you cannot even distinguish between uppercase and lowercase.

This could be an entertaining hobby project, and possibly very educational. Best of luck with it. I would recommend studying the history and controversy around Han unification to get a feel for how complex these topics become in practice. Some starter questions just to think about:

Some forms of Romanized Arabic (also Klingon, though not currently part of Unicode) use letter case to distinguish between completely different letters. Does that change anything about your encoding?
On AZERTY keyboards é has its own key. In your encoding is é a single "letter" or a modification of e? How do "regular keystrokes" apply to this?
What does "scan code" mean for this? Usually scan codes represent just a location on a keyboard, so how does that work for other keyboard layouts like AZERTY or Dvorak?

I would also study the history of UTF-16, and why UTF-8 has been so much more successful. Creating a new encoding that is not backward compatible with Latin-1 and requires substantially more space to store English is going to need some major advantages to actually be deployed. (See also IPv6.) But that impracticality should not dissuade you from exploring it.

The TTS codes were vaguely what I was looking for. Thanks! Even though I was aware of most other points (e.g. to quote myself: "*seemingly* cosmetic variants in Unicode, because they convey a *different meaning*") it is good that you explained them in detail for the bystanders. Only for the first part I have to object: ASCII is very different from what I was searching, as it is a *fixed width* encoding. As you wrote, you have to send the 2-bit control sequence; even if you don't care for lettercasing. There is no *default* case you can use for space efficiency. — Socowi, Apr 24 '22 at 18:58

Is there a character encoding or a markup language with a lettercase modifier?

Question

Why Would Someone Do This?

Representation and Meaning

In an alternate universe ...

1 Answers1