3

I noticed that while you are inputting emojis in your phone message some of them take 1 character and some of them are taking 2. For example, "♊" takes 1 char but "" takes 2. In python, I'm trying to get length of emojis and I'm getting:

len("♊") # 3
len("") # 4
len(unicode("♊", "utf-8")) # 1 OH IT WORKS!
len(unicode("", "utf-8")) # 1 Oh wait, no it doesn't.

Any ideas?

This site has emojis length in Character.charCount() row: http://www.fileformat.info/info/unicode/char/1F601/index.htm

Ted Klein Bergman
  • 9,146
  • 4
  • 29
  • 50
  • Related: [How to work with surrogate pairs in Python?](http://stackoverflow.com/a/38147966/3439404). Try something like `import unicodedata; unistr = u'♊'; print unistr, repr( unistr), len(unistr); for char in unistr:print len(char), char, repr(char), unicodedata.category(char), unicodedata.name(char,'private use');` – JosefZ Mar 14 '17 at 12:27
  • Thanks for reply this is the result of your suggestion: `\u264a\U0001f601 u'\u264a\U0001f601' 2 1 \u264a u'\u264a' So GEMINI 1 \U0001f601 u'\U0001f601' Cn private use` As You can see it still read each emoji as 1 character. I did find that stack question but I'm still trying to make surrogate work. – user7707957 Mar 14 '17 at 12:37
  • On my terminal, `\U0001f601` is transformed to a surrogate pair in the `for …` loop as `♊ u'\u264a\U0001f601' 3`… `1 ♊ u'\u264a' So GEMINI`… `1 � u'\ud83d' Cs private use`… `1 � u'\ude01' Cs private use` (used **…** instead a newline) – JosefZ Mar 14 '17 at 12:46
  • I checked your code in python2.7 and python3.5 and I got same results 2 characters. Interesting that we have different terminal results. – user7707957 Mar 14 '17 at 13:02
  • 1
    It's because `import sys;print hex(sys.maxunicode)` returns `'0xffff'` in my `py -2` and `'0x10ffff'` in my `py -3`. Python 3 returns 1 for `len('')` (character itself) but Python 2 returns 2 (surrogate pair). – JosefZ Mar 14 '17 at 13:10
  • Please avoid putting answers into questions. Open [help], read [answering](http://stackoverflow.com/help/answering), especially [Can I answer my own question?](http://stackoverflow.com/help/self-answer). Please take the 2-minute [tour] to jog your memory about how StackExchange sites work. – JosefZ Mar 14 '17 at 14:18

1 Answers1

1

Read sys.maxunicode:

An integer giving the value of the largest Unicode code point, i.e. 1114111 (0x10FFFF in hexadecimal).

Changed in version 3.3: Before PEP 393, sys.maxunicode used to be either 0xFFFF or 0x10FFFF, depending on the configuration option that specified whether Unicode characters were stored as UCS-2 or UCS-4.

The following script should work in both Python versions 2 an 3:

# coding=utf-8

from __future__ import print_function
import sys, platform, unicodedata

print( platform.python_version(), 'maxunicode', hex(sys.maxunicode))
tab = '\t'
unistr = u'\u264a \U0001f601'                          ###   unistr = u'♊ '
print ( len(unistr), tab, unistr, tab, repr( unistr))
for char in unistr:
    print (len(char), tab, char, tab, repr(char), tab, 
        unicodedata.category(char), tab, unicodedata.name(char,'private use'))

Output shows consequence of different sys.maxunicode property value. For instance, the character (unicode codepoint 0x1f601 above the Basic Multilingual Plane) is converted to corresponding surrogate pair (codepoints u'\ud83d' and u'\ude01') if sys.maxunicode results to 0xFFFF:

PS D:\PShell> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8

PS D:\PShell> . py -3 D:\test\Python\Py\42783173.py
3.5.1 maxunicode 0x10ffff
3      ♊    '♊ '
1      ♊      '♊'      So      GEMINI
1             ' '      Zs      SPACE
1           ''      So      GRINNING FACE WITH SMILING EYES

PS D:\PShell> . py -2 D:\test\Python\Py\42783173.py
2.7.12 maxunicode 0xffff
4      ♊    u'\u264a \U0001f601'
1      ♊      u'\u264a'    So      GEMINI
1             u' '         Zs      SPACE
1      ��     u'\ud83d'    Cs      private use
1      ��     u'\ude01'    Cs      private use

Note: above output examples were taken from Unicode-aware Powershell-ISE console pane.

Community
  • 1
  • 1
JosefZ
  • 28,460
  • 5
  • 44
  • 83