5

I am working to make an ASN.1 parser in the C language (using the Ericsson ASN1 specification document). I want to decode the UTF-8 string type but I can't find information about this online, and the document I'm using does not describe UTF-8 string in detail. Can anybody provide me with some code, or explain how to decode it.

I am new to ASN.1.

Norman Gray
  • 11,978
  • 2
  • 33
  • 56
user3148326
  • 121
  • 1
  • 7
  • 2
    http://en.wikipedia.org/wiki/UTF-8 describes how UTF-8 encodes characters and even has sample code in C. – user3386109 Mar 08 '15 at 18:26
  • I presume that this question is about decoding the ASN.1 `UTF8String` sequence into the array of UTF-8 bytes, as opposed to going from those bytes to the Unicode string (that is, the `utf8-decode` tag isn't quite appropriate). Can you confirm this? (and if so, perhaps clarify in the question) – Norman Gray Mar 08 '15 at 23:58
  • why i got -2 Ratings? what is wrong with the question? – user3148326 Mar 09 '15 at 17:25
  • 1
    @user3148326 I think people are mistaking your request for info on the ASN.1 UTF8String type (very uncommon) for a request for information about UTF-8 string in general (very common and easily googlable.) – Reid Rankin Sep 25 '16 at 02:50

2 Answers2

9

If you're trying to parse ASN.1, then an excellent introductory resource is Kaliski's ‘Layman’s Guide’ (available at various places on the web, in HTML and PDF). However that document doesn't mention the UTF8String type.

The extra information you need to know is that UTF8String has tag 12 (decimal, or 0c in hex), and that it's encoded as a sequence of the bytes representing the string in the UTF-8 encoding.

Thus the string ‘Helló’ would be encoded as

0c 06 48 65 6c 6c c3 b3

(I'm presuming, by the way, that ‘Ericsson ASN1 specification document’ discusses the standard ASN.1, and not some variant.)

Norman Gray
  • 11,978
  • 2
  • 33
  • 56
  • 2
    One of the disappointing things about one of the commercial ASN.1 toolsets that I've used is that it doesn't check that the UTF8 string being encoded / decoded is actually valid UTF8. UTF8String is simply treated as yet another OCTET STRING, which can represent any old string of bytes. UTF8 has rules as to what bytes follow what, thus some byte combinations are not valid. It would be good if the ASN.1 compilers added checks for this this in the same way that they check any other value or size constraint stated in an ASN.1 schema. This would add another layer of built in content inspection. – bazza Aug 30 '15 at 06:42
  • (ran out of comment space). The same goes for the other string types, such as IA5String, etc. – bazza Aug 30 '15 at 06:43
  • Note that when the length of the encoded string > 127, the following rules are used to construct the bytes representing that length: https://msdn.microsoft.com/en-us/library/windows/desktop/bb648641(v=vs.85).aspx – Tails Mar 14 '17 at 17:23
  • 0c (meaning UTF8type), then length (one octet), then actual UTF8 bytes – k3a Mar 22 '17 at 23:17
-3

A full UTF-8 description, which allows you to write an encoder and a decoder is summarized in the table available in the Wikipedia page:

http://en.wikipedia.org/wiki/UTF-8#Description

hdante
  • 7,685
  • 3
  • 31
  • 36
  • 6
    -1 This has almost nothing to do with the question. Anyone can google for the UTF-8 spec.. The question is how these strings are handled/encoded in [ASN.1](https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One) – Tersosauros Jun 20 '16 at 13:17