How to get length of Thai string in c#?

Question

I am facing issues while calculating length of strings that contain Thai characters. In the below image from Notepad++, we need a way in c#/.NET to get the Document Length value (170) for the provided string -

Thai String Issue

String.Length works well for English Language/Characters, but for this example ("บ.อินเตอร์เทค+สเปคเชียวตี้+กลาส+จำกัด+28%2f10+หมู่+1+ต.+คลองอุดมชลจร") - it returns length as 69, instead of 170. Is there a way in c#/.NET to get the actual length of string values for non-English languages?

I tried using Encodings as well, but no luck. Any pointers/help on this will be great.

Thanks in advance!

Sounds like you're looking for [`Encoding.GetByteCount()`](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.getbytecount) — Mathias R. Jessen, Apr 04 '22 at 14:56
See if these links help: https://stackoverflow.com/a/40307732/421195, https://stackoverflow.com/a/50952586/421195. Q: Can you show us the C# code that reads the string? Q: Do you know the encoding type when you read the string? UTF8? UTF16? Other? — paulsm4, Apr 04 '22 at 15:04
[See here](https://sharplab.io/#v2:C4LgTgrgdgNAJiA1AHwAICYCMBYAUKgBgAJVMA6AcQBsB7AIwEMqBLALweGZqgG49CS5ACoBTAB7A+ufpmIBnIgF4SqAESAsODKBaOEAscIEw4QAJwgVDgtgYjhAMnAHA6HCAQOESAqOAOBsOBsHAUHCBWOEBEcIHI4Ix8BJOERAQDhAUjhAJjh7REAIOEBmOGDARjhAFDhEdAAOAFJ0ADNZREBqOEBCOEBOOEAJOERMRCMyRBtQrUBwOC1ADjgkwrdQ6NM8VSkZAE4ACgASVQBVIQAxAFo0ojoAT2ARORAiAG8AUSgAYxo4ZigAczIJybTKEWAAISWRAGEaaGBBuQBKAF9VN6lSIdGADJXZZgVYbKAiADuRAAysAwIcjgBJKDZGivN5kIHHYAACxRogkmyoIgAtiIoMA5F8fnggA===) — canton7, Apr 04 '22 at 15:10
Encoding.UTF8.GetByteCount() solves the purpose for me. Thank you everyone for the quick responses & time. Appreciate all your help on this. — Pratik Prakash, Apr 04 '22 at 15:26
Code-point vs character vs byte: learn up what each means, then it will become obvious — Charlieface, Apr 04 '22 at 15:29
You might want to throw "Extended Grapheme Cluster" and "glyph" into the list of stuff to learn — canton7, Apr 04 '22 at 15:30
thank you for suggestions Charlieface and canton7, I will check on these. — Pratik Prakash, Apr 04 '22 at 15:37

score 3 · Accepted Answer · answered Apr 04 '22 at 14:56

3

69 is correct, though.

บ.อินเตอร์เทค+สเปคเชียวตี้+กลาส+จำกัด+28%2f10+หมู่+1+ต.+คลองอุดมชลจร contains 69 characters; the UTF-8 encoding of it is 170 bytes long.

Notepad++ is showing you the length of the encoded content.

If you do need the encoded length, use Encoding.UTF8.GetByteCount().

answered Apr 04 '22 at 14:56

AKX

152,115
15
115
172

2

As I select that string character-by-character, I count only 58 characters, not 69. It's likely those diacritics are being represented as combining marks – canton7 Apr 04 '22 at 15:02
Also note that the 170 figure includes the trailing newline. The string in your answer is only 169 bytes in UTF-8. – canton7 Apr 04 '22 at 15:06
@canton7 Yes, in notepad (or notepad++) the column indicator shows the character count, including combining marks. When you move past a character that combines several unicode diacritics the column counter advances by more than one character. This `มู่` character, for example, counts as three (0x0E21, 0x0E39, 0x0E48). – J... Apr 04 '22 at 15:07
`อิ` consists out of two chars for example, its not `อ`. – Fruchtzwerg Apr 04 '22 at 15:09

How to get length of Thai string in c#?

1 Answers1