12

I'm trying to implement a cross-platform (desktop browsers, iOS, & Android) typography system that allows users to input any Unicode string.

What are some strings I should use to stress-test my system and ensure the most nines of users will have a good experience? Is there a standard or de-facto standard list that I can also use?

Ky -
  • 30,724
  • 51
  • 192
  • 308
  • If this is off-topic here, please direct me somewhere I can find my answer. – Ky - Dec 30 '15 at 22:45
  • Doesn't seem off-topic just a bit too vague for it to be likely you'll get much useful feedback. – pvg Dec 30 '15 at 22:48
  • @pvg any idea how I could make it more specific? – Ky - Dec 30 '15 at 23:16
  • well, you say input then talk about rendering, you mention several different platforms all of which their own font rendering and input systems, some of which with limited end-user control. So it's not really obvious what you're doing, what you're trying to achieve, what specific problems you are encountering or hoping to avoid, etc. – pvg Dec 30 '15 at 23:24
  • I'm creating a view which displays text (that can be supplied by a user) in fancy typographical styles (italic, colored, rotated, centered, etc.). What I want to achieve is ensuring any text the user supplies will render as intended. What I want to avoid is text that is unreadable, or otherwise does not convey the user's intended meaning, solely because of the chosen arrangement of characters. – Ky - Dec 30 '15 at 23:31
  • There isn't any way to ensure that, in a cross-platform way. Even the samples you have already fail on Chrome OS X, let alone IE or Chrome for Windows and those are just a couple that I tried, although, again, the specifics are unclear. A view in what? A web browser? An app? Etc. – pvg Dec 31 '15 at 00:25
  • @pvg A custom view in an application. Any rendering problems, I can fix manually. This is why I want to know the toughest problems in Unicode, so I can test for them and fix them. – Ky - Jan 04 '16 at 14:34
  • 1
    +1 from me for the samples you already have. Fascinating to see how well modern browsers and VS Code handle this stuff. – HeyHeyJC Jul 25 '18 at 21:10
  • @HeyHeyJC thanks! I've separated them out into their own answer, since it seems maybe I already have a good enough list to be a good answer – Ky - Jul 26 '18 at 13:15

3 Answers3

15

Here are some strings that I use in tests like that:

  • Vertically-stacked characters: Z̤͔ͧ̑̓ä͖̭̈̇lͮ̒ͫǧ̗͚̚o̙̔ͮ̇͐̇
  • Right-to-left words: اختبار النص
  • Mixed-direction words: من left اليمين to الى right اليسار
  • Mixed-direction characters: a‭b‮c‭d‮e‭f‮g
  • Very long characters: ﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽
  • Emoji with skintone variations:
  • Emoji with gender variations: ‍♀️‍♂️
  • Emoji created by combining codepoints: ‍❤️‍‍‍‍‍️‍⚧️
Ky -
  • 30,724
  • 51
  • 192
  • 308
1

Some others:

  • Reversible characters in Right-to-Left scripts. Ex. Parentheses get reversed for display in Hebrew. Unicode spec has a whole list of these reversible characters.
  • Scripts with letter shaping: Arabic, Hindi, etc.
  • These sound super fascinating! Do you have any samples? – Ky - Sep 09 '20 at 16:26
  • Microsoft font development resources seem to have some good examples of script "shaping". They show examples where multiple Unicode characters get assembled into the proper shape for the script. Sorta like turning "ff" into the 'ff' ligature character, but much much more complicated. Indic: https://learn.microsoft.com/en-us/typography/script-development/devanagari Arabic: https://learn.microsoft.com/en-us/typography/script-development/arabic – Rich Taylor Sep 11 '20 at 19:19
  • Reversible characters are ones that can be tricky when the context of rendering them changes between left-to-right and right-to-left script. For example, in a left-to-right script (ex. English), an opening bracket is rendered '['. But in a right-to-left script the opening bracket is rendered ']'. Within a single text line with a mixture of L2R and R2L text you have to keep track of current direction in order to draw the correct glyphs amongst the characters which can be rendered blindly (i.e. without consideration for current direction). – Rich Taylor Sep 11 '20 at 19:22
  • Here's an issue with reversible characters in LibreOffice - including some test text strings: https://ask.libreoffice.org/en/question/18912/bidirectional-text-and-closing-bracket-bug/ – Rich Taylor Sep 11 '20 at 19:39
  • Those are very insightful indeed! I tried to edit this answer to include some, but it really didn't want me doing that - if you ever find a way to make that happen, StackOverflow prefers that, so content isn't lost if links rot away – Ky - Sep 14 '20 at 21:48
0

There are a lot of good examples in the Big List of Naughty Strings:

https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt

I cannot include the whole file, but here's a few lines:

#   Unicode Subscript/Superscript/Accents
#
#   Strings which contain unicode subscripts/superscripts; can cause rendering issues


ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็
#   Two-Byte Characters
#
#   Strings which contain two-byte characters: can cause rendering issues or character-length issues

田中さんにあげて下さい
#   Strings which contain two-byte letters: can cause issues with naïve UTF-16 capitalizers which think that 16 bits == 1 character

  /        
#   Special Unicode Characters Union
#
#   A super string recommended by VMware Inc. Globalization Team: can effectively cause rendering issues or character-length issues to validate product globalization readiness.

表ポあA鷗ŒéB逍Üߪąñ丂㐀
#   Ogham Text
#
#   The only unicode alphabet to use a space which isn't empty but should still act like a space.

᚛ᚄᚓᚐᚋᚒᚄ ᚑᚄᚂᚑᚏᚅ᚜
᚛                 ᚜

#   iOS Vulnerabilities
#
#   Strings which crashed iMessage in various versions of iOS

Powerلُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ冗
0️
జ్ఞ‌ా
Ky -
  • 30,724
  • 51
  • 192
  • 308