-1

I want/need a test case for testing/breaking conversions between UTF-32 and UTF-16.

For UTF-8 and UTF-16, I generally use the 'Chinese Bone' test: 0xE9 0xAA 0xA8 (UTF8) and 0x9AA8 (UTF16).

Does anyone have a negative test case that should break a poorly written implementation for UTF-16 and UTF-32? Ideally, the test will require use of at least two UTF-32 values.

Jeff

jww
  • 97,681
  • 90
  • 411
  • 885
  • What do you mean by a "negative test case"? –  Mar 24 '13 at 07:00
  • Something that is meant to test for failure, not success. – jww Mar 24 '13 at 07:29
  • 2
    By UCS32 and UCS16, I assume you mean UCS4 and UCS2 (UCS32 and UCS16 don't exist). Neither UCS4 nor UCS2 can fail in any way related surrogates, because neither one makes use of surrogates. Surrogates are used exclusively by UTF16. Also, can you explain further how exactly the 'Chinese Bone' can break a poorly written implementation? I would have thought the conversion of this character between UTF16 and UTF8 was straightforward... Finally, the `U+` notation is used for Unicode code points as abstract integers (not code units in UTF8 or UTF16 or anything else). You should not use it for UTF16. – Celada Mar 24 '13 at 12:13
  • 2
    Regardless of the question, I would sincerely recommend you against using both UTF16 and UCS-4 unless you are doing some edge-case optimization. See http://utf8everywhere.org – Pavel Radzivilovsky Mar 24 '13 at 14:36
  • Thanks Pavel. I know Java and Microsoft use UTF-16, so I'm interested in trying to test implementations for those platforms. – jww Mar 25 '13 at 02:23
  • @Celada - "Also, can you explain further how exactly the 'Chinese Bone' can break a poorly written implementation?" That is how it is supposed to work in real life. It's now how it always works in real life. Hence the reason I want the test cases. – jww Mar 25 '13 at 02:27
  • @R. Martino - "Failure of what?" Failure of the implementation. Surely you don't accept an author's word that "everything is OK" without doing a minimal amount of testing? – jww Mar 25 '13 at 02:30

1 Answers1

1

Not sure what you mean, here are some:

UTF-16

  • Lead surrogate with regular unit or another lead surrogate following \xD8\x00\x00\x00 or \xD8\x00\xDB\xFF
  • Trail surrogate without lead surrogate before it \x00\x61\xDC\00
  • Trail surrogate in lead position \xDF\xFF\xDB\xFF
  • Lead surrogate as last unit \xD8\x01<EOF>
  • Lead surrogate as last unit, followed by a half trail surrogate. This bug exists in python 2.7.3: '\xD8\x00\xDC'.decode('utf-16be')

UTF-32

  • Unit value returns true for value < 0, value > 0x10FFFF or 0xD800 <= value && value <= 0xDFFF
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Thanks Esailija. "Not sure what you mean" - most folks get the encoding of a simple 'a' correct. I'm trying to develop test cases to break libraries that can only get the simple 'a' correct. – jww Mar 25 '13 at 02:18