0

I am facing issues while calculating length of strings that contain Thai characters. In the below image from Notepad++, we need a way in c#/.NET to get the Document Length value (170) for the provided string -

Thai String Issue

String.Length works well for English Language/Characters, but for this example ("บ.อินเตอร์เทค+สเปคเชียวตี้+กลาส+จำกัด+28%2f10+หมู่+1+ต.+คลองอุดมชลจร") - it returns length as 69, instead of 170. Is there a way in c#/.NET to get the actual length of string values for non-English languages?

I tried using Encodings as well, but no luck. Any pointers/help on this will be great.

Thanks in advance!

Toto
  • 89,455
  • 62
  • 89
  • 125
  • 1
    Sounds like you're looking for [`Encoding.GetByteCount()`](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.getbytecount) – Mathias R. Jessen Apr 04 '22 at 14:56
  • 2
    See if these links help: https://stackoverflow.com/a/40307732/421195, https://stackoverflow.com/a/50952586/421195. Q: Can you show us the C# code that reads the string? Q: Do you know the encoding type when you read the string? UTF8? UTF16? Other? – paulsm4 Apr 04 '22 at 15:04
  • [See here](https://sharplab.io/#v2:C4LgTgrgdgNAJiA1AHwAICYCMBYAUKgBgAJVMA6AcQBsB7AIwEMqBLALweGZqgG49CS5ACoBTAB7A+ufpmIBnIgF4SqAESAsODKBaOEAscIEw4QAJwgVDgtgYjhAMnAHA6HCAQOESAqOAOBsOBsHAUHCBWOEBEcIHI4Ix8BJOERAQDhAUjhAJjh7REAIOEBmOGDARjhAFDhEdAAOAFJ0ADNZREBqOEBCOEBOOEAJOERMRCMyRBtQrUBwOC1ADjgkwrdQ6NM8VSkZAE4ACgASVQBVIQAxAFo0ojoAT2ARORAiAG8AUSgAYxo4ZigAczIJybTKEWAAISWRAGEaaGBBuQBKAF9VN6lSIdGADJXZZgVYbKAiADuRAAysAwIcjgBJKDZGivN5kIHHYAACxRogkmyoIgAtiIoMA5F8fnggA===) – canton7 Apr 04 '22 at 15:10
  • Encoding.UTF8.GetByteCount() solves the purpose for me. Thank you everyone for the quick responses & time. Appreciate all your help on this. – Pratik Prakash Apr 04 '22 at 15:26
  • 1
    Code-point vs character vs byte: learn up what each means, then it will become obvious – Charlieface Apr 04 '22 at 15:29
  • 1
    You might want to throw "Extended Grapheme Cluster" and "glyph" into the list of stuff to learn – canton7 Apr 04 '22 at 15:30
  • thank you for suggestions Charlieface and canton7, I will check on these. – Pratik Prakash Apr 04 '22 at 15:37

1 Answers1

3

69 is correct, though.

บ.อินเตอร์เทค+สเปคเชียวตี้+กลาส+จำกัด+28%2f10+หมู่+1+ต.+คลองอุดมชลจร contains 69 characters; the UTF-8 encoding of it is 170 bytes long.

Notepad++ is showing you the length of the encoded content.

If you do need the encoded length, use Encoding.UTF8.GetByteCount().

AKX
  • 152,115
  • 15
  • 115
  • 172
  • 2
    As I select that string character-by-character, I count only 58 characters, not 69. It's likely those diacritics are being represented as combining marks – canton7 Apr 04 '22 at 15:02
  • Also note that the 170 figure includes the trailing newline. The string in your answer is only 169 bytes in UTF-8. – canton7 Apr 04 '22 at 15:06
  • @canton7 Yes, in notepad (or notepad++) the column indicator shows the character count, including combining marks. When you move past a character that combines several unicode diacritics the column counter advances by more than one character. This `มู่` character, for example, counts as three (0x0E21, 0x0E39, 0x0E48). – J... Apr 04 '22 at 15:07
  • `อิ` consists out of two chars for example, its not `อ`. – Fruchtzwerg Apr 04 '22 at 15:09