5

I am trying to search for emoticons in python strings. So I have, for example,

em_test = ['\U0001f680']
print(em_test)
['']
test = 'This is a test string '
if any(x in test for x in em_test):
    print ("yes, the emoticon is there")
else: 
    print ("no, the emoticon is not there")

yes, the emoticon is there

and if a search em_test in

'This is a test string '

I can actually find it.

So I have made a csv file with all the emoticons I want defined by their unicode. The CSV looks like this:

\U0001F600

\U0001F601

\U0001F602

\U0001F923

and when I import it and print it I actullay do not get the emoticons but rather just the text representation:

['\\U0001F600',
 '\\U0001F601',
 '\\U0001F602',
 '\\U0001F923',
...
]

and hence I cannot use this to search for these emoticons in another string... I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.

Any suggestions?

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Bullzeye
  • 153
  • 1
  • 11

2 Answers2

3

You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.

Just for fun, I'll also use unicodedata to get the names of those emojis.

import unicodedata as ud

emojis = [
    '\\U0001F600',
    '\\U0001F601',
    '\\U0001F602',
    '\\U0001F923',
]

for u in emojis:
    s = u.encode('ASCII').decode('unicode-escape')
    print(u, ud.name(s), s)

output

\U0001F600 GRINNING FACE 
\U0001F601 GRINNING FACE WITH SMILING EYES 
\U0001F602 FACE WITH TEARS OF JOY 
\U0001F923 ROLLING ON THE FLOOR LAUGHING 

This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.

You can make the decoding a little more robust by using

u.encode('Latin1').decode('unicode-escape')

but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • This is great and works. I only have an issue with some specific emoticons: '\U00023EB' -> get an error of SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape – Bullzeye Nov 13 '17 at 12:56
  • @Bullzeye Python considers `'\U00023EB'` to be invalid: "Big U " Unicode escapes **must** contain 8 hex digits. We could handle that in my code, but it's probably better to fix it in the code that builds the CSV. – PM 2Ring Nov 13 '17 at 12:58
  • Yes, just wondering where then the following code U+23EB taken from the site came from https://unicode.org/emoji/charts/full-emoji-list.html – Bullzeye Nov 13 '17 at 12:59
  • My bad, just seems to need one more 0 instead of the + sign so that it adds up to 8. Sorry – Bullzeye Nov 13 '17 at 13:04
  • @Bullzeye Alternatively, for smaller code points that fit in 4 hex digits you can use the "small u" escape code: `'\u23eb'`. 'unicode-escape' can handle that. It also handles stuff like `'\\x41'` And of course it also handles plain ASCII text. – PM 2Ring Nov 13 '17 at 13:06
  • yes ast is bloated so what ? :) very good answer. I like the decode/encode trick. – Jean-François Fabre Nov 13 '17 at 14:50
  • as long as you don't have to decode Portugese that will do fine :) – Jean-François Fabre Nov 13 '17 at 14:52
  • @Jean-FrançoisFabre Fair point. I'll add some more info to my answer. – PM 2Ring Nov 13 '17 at 14:52
  • @PM 2Ring - your answer totally suffices. Thank you! – Bullzeye Nov 13 '17 at 16:02
  • @PM2Ring that was a joke / reference to the post in portugese/espanol from the other day. But if you improve your answer with that then ok! – Jean-François Fabre Nov 13 '17 at 16:07
  • @Jean-FrançoisFabre Ah, right! I'd forgotten about that little incident. I'm usually pretty good at telling Romance languages apart, but it was late, and I was in a hurry. – PM 2Ring Nov 13 '17 at 16:18
1

1. keeping your csv as-is:

it's a bloated solution, but using ast.literal_eval works:

import ast

s = '\\U0001F600'

x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)

I get 0x1f600 (which is correct char code) and some emoticon character (). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)

just surround with quotes to allow ast to take the input as string.

2. using character codes directly

maybe you'd be better off by storing the character codes themselves instead of the \U format:

print(chr(0x1F600))

does exactly the same (so ast is slightly overkill)

your csv could contain:

0x1F600
0x1F601
0x1F602
0x1F923

then chr(int(row[0],16)) would do the trick when reading it: example if one 1 row in CSV (or first row)

with open("codes.csv") as f:
   cr = csv.reader(f)
   codes = [int(row[0],16) for row in cr]
Community
  • 1
  • 1
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • ok - the print works great! Can I kindly ask you to elaborate on how to read the csv - do not get the chr(int(row[0],16)) part - how is this integrated for example in pos_emo_twitter = pandas.read_csv('list pos emoticons.csv') – Bullzeye Nov 13 '17 at 12:57