1

This is from twitter doc: https://developer.twitter.com/en/docs/basics/counting-characters.html

"Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text ... Twitter also counts the number of codepoints in the text rather than UTF-8 bytes."

It works for Western languages. But when I apply FormC normalization to the following, for example:

(I posted an example in Korean, but stackoverflow considers it spam and doesn't let me post it)

I get the value of 160. On Twitter's Web client, this is the maximum available message - adding even one character goes over the limit.

Applying FormD to the above gets a value over 300.

Since Twitter limit is either 140 or 280, I really don't understand how that message's char count is determined by Twitter.

So - how in the world can I figure out what the actual message length is for non-Western languages for a tweet?

The code to normalize, in c#:

    private static int GetCodepointLength(string inp)
    {
        var info = new StringInfo(inp.Normalize(NormalizationForm.FormC));
        return info.LengthInTextElements;
    }
MikeMedved
  • 21
  • 1
  • 7
  • The example string was: 이것은 단지 시험 일 뿐이라는 테스트입니다.이것이 진짜 메시지 였으면 여기에 물질의 어떤 것이 보일 것입니다이것은 단지 시험 일 뿐이라는 테스트입니다.이것이 진짜 메시지 였으면 여기에 물질의 어떤 것이 보일 것입니다이것은 단지 시험 일 뿐이라는 테스트입니다.이것이 진짜 메시지 였으면 여 – MikeMedved Oct 14 '18 at 20:22

0 Answers0