How do I get a UCS code of a 1-byte letter of UTF-8 in C++?

Question

I need to check, whether a letter (in english and russian languages) is alphabetical. A file is supposed to be encoded with UTF-8 by default. I found out, that the best solution is working with UCS codes. The way to calculate UCS-code of 2-bytes encoded letter is

#include <stdio.h>
#include <stdlib.h>

char utf8len[256] = { 
  // len = utf8len[c] & 0x7  cont = utf8len[c] & 0x8 
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 0  - 15
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 16 - 31
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 32 - 47
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 48 - 63
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 64 - 79
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 80 - 95
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 96 - 111
  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, // 112 - 127

  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8, // 80 - 8f
  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8, // 90 - 9f
  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8, // a0 - af
  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8, // b0 - bf

  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, // c0 - cf
  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, // d0 - df

  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3, // e0 - ef

  4,  4,  4,  4,  4,  4,  4,  4,  // f0 - f7

  5,  5,  5,  5,  // f8, f9, fa, fb

  6,  6,  // fc, fd

  0,  0   // fe, ff 
};

#define UTF8LEN(c) (utf8len[(unsigned char)(c)] & 0x7)
#define UTF8CONT(c) (utf8len[(unsigned char)(c)] & 0x8)

int main (int argc, char *argv[])
{
  char *s = "Б№1АГД"; //string which contains cyrillic symbols

  while (*s) {
    int ucode;

    printf ("[%s] %d\n", s, UTF8LEN(*s));
    if ((UTF8LEN(*s) == 2) && UTF8CONT(s[1])) {
      ucode = ((*s & 0x1f) << 6) | (s[1] & 0x3f); //! HERE I GET UCS CODE 
      printf ("ucode = 0x%x\n", ucode);
      s++;
    }
    s++;
  }

}

It's a half of the solution I'm looking for. This code alows me to work with cyrillic symbols only (as they're encoded with 2 bytes in UTF-8). The problem is, I need to work with latin alphabet as well. So what should i do to get UCS code for 1-byte symbol (in my case with UTF8LEN(c)=1)?

Upd: Probably, the solution is:

ucode = *s

Will this work?

This sounds like a situation that could end up being broken in quite subtle ways; unicode is not really a format where you can just peek at any given byte and say "yep, that's alphabetic". Have you considered using a unicode library like ICU? — Rook, May 19 '14 at 18:32
@Rook I have to work with UTF-8 without using any special libraries for that. The point is that I'm a student, and this is a part of my task... So i need to work with UCS codes. I don't know another way to find out what symbol I'm dealing with (is it alphabetical, digit or special sing...). I updated my question, adding a possible solution. How do you think, is it correct? — lidia, May 19 '14 at 18:38
This is a homework assignment, on something as complex as UTF8, and they said "no libraries"? Wow. Were you given a definition of what "alphabetic" should be in this exercise? — Rook, May 19 '14 at 18:43
I don't know why the 80-bf range is `8`, but f8-ff are 4-6 and 0. None of those are valid lead byte. Also `UTF8CONT` is just wrong. — Mooing Duck, May 19 '14 at 18:43
You don't need a table to convert utf8 to ucs2 (ucs4). [Here](http://www.lemoda.net/c/utf8-to-ucs2/index.html) is C routine to convert UTF-8 to UCS2. Than you create table, describing ranges of alphabetical characters. Then you should lookup that table to get character type. You solution is very confusing. — alexander, May 19 '14 at 18:55
@alexander, thanks for the code you provided. It's really helpful. But, could you explain to me what should I set as the second parameter (** end_ptr)? I read the file with usual *char, so I don't know how to use this function perfectly. The idea of it is clear to me. As for the second part of your comment - I've read that a lookup-table is faster then the way you showed in a link The point is, I'm working with two languages only - english and russian, so I don't need symbols with the code greater then 256. — lidia, May 19 '14 at 19:05
@MooingDuck i don't know about this table too - I found this solution in the web, and it works properly... — lidia, May 19 '14 at 19:07
@lidia: Russian doesn't have any codepoints less than 256, does it? Cyrillic uses Unicode codepoints U+0400 to U+052F. If you think it's less than 256, you might be dealing with code page 866 instead of Unicode. — Mooing Duck, May 19 '14 at 19:10
@MooingDuck, you're right. I wrote wrong things. I'm working with Unicode codepoints U+0410 to U+044F. I suppose, the answer about the tables is in the way, how we calculate the length in `#define UTF8LEN(c) (utf8len[(unsigned char)(c)] & 0x7) ` But I'm not sure. I agree with every word of you, to be honest. — lidia, May 19 '14 at 19:16
@lidia: With Alexander's code, you give it a pointer to the beginning, and a pointer to another pointer. The function will return the Unicode code point value for the first character, be it 1 byte or four, in any language, and sets the second pointer to point at where it finished. That makes it easy and fast to iterate over the string. — Mooing Duck, May 19 '14 at 19:16
Oh, for valid UTF8, your code for `UTF8LEN` works fine. I only thought it was odd that it results in strange values when given _invalid_ UTF8. — Mooing Duck, May 19 '14 at 19:21
@MooingDuck and what if I change this function, deleting the second argument? I can't understand, where should I get the second pointer. I'm reading a text file, and I have a pointer to the beginning. I don't need the second one. Will it work without it? Sorry for so stupid questions, maybe, I'm a beginner. — lidia, May 19 '14 at 19:24
@lidia, you read file in a char[] buffer. Than you iterate over buffer with utf8_to_ucs2 (like this: 'utf8_to_ucs2(curr, &next')'). utf8 character takes 1 to 4 bytes. utf8_to_ucs2 adjusts 'input' by utf8-character length and return pointer to next character in buffer. And that pointer you should provide to next call to utf8_to_ucs2. — alexander, May 19 '14 at 19:34
@alexander Does your function works with 1-, 2- and 3-bytes encoded chars only? — lidia, May 19 '14 at 19:36
look at [that](http://en.wikipedia.org/wiki/UTF-8#Description) table. Only 3-byte sequence UTF8 could be mapped to UCS2. Also that table gives good picture about utf8_to_ucs2 background. — alexander, May 19 '14 at 19:43
@lidia: 4-byte chars are new, and almost never used, and cannot be stored in a UCS2 character. 5 and 6 byte characters were later deemed to be invalid. — Mooing Duck, May 19 '14 at 19:44
4-byte UTF-8 sequences cannot be stored in UCS-2, but they can be stored in UTF-16. Any 1-byte to 4-byte UTF-8 sequence is decodable to both UTF-16 and UTF-32, and any UTF-16 sequence is decodable to UTF-32. The UTF-32 values are ultimately what you need since those are the same values found in the Unicode codepoint charts. — Remy Lebeau, May 19 '14 at 23:22

How do I get a UCS code of a 1-byte letter of UTF-8 in C++?

0 Answers0