0

I want to use imaplib to search particular emails, which subjects contain Chinese. I got the error like this:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

so i use .encode to encode to 'UTF-8', and I got nothing. The print print out is

0
[]

The right answer should be 71, which I search on my inbox through my mail. This is my code:

import imaplib,email
host = 'imap.263.net'
user = '***@***'
psw = '*****'
count = 0
con = imaplib.IMAP4(host,143)
con.login(user,psw)
con.select('INBOX',readonly =True)
eva = '日报'
# eva = eva.encode('utf-8') 
resp,liujf = con.search('UTF-8','SUBJECT','%s'%eva, 'Since','01-Feb-2018')
items = liujf[0].split()
print(len(items))
print(items)

I guess it should be unicode problem. How can I fix it?

2 Answers2

3

You are passing in a raw Unicode string where you should be passing in the string as a sequence of UTF-8 bytes. You've even labelled it as UTF-8! This suggests you might want to read up on the difference.

Change

'%s'%eva

to

eva.encode('utf-8')

For more background, maybe read https://www.unicode.org/faq/utf_bom.html#UTF8 and/or https://nedbatchelder.com/text/unipain.html

The construct '%s'%string is just an ugly and unidiomatic way to say string but here it's actually an error: '%s'%string.encode('utf-8') produces a byte string but then interpolates it into a Unicode string which produces completely the wrong result. Observe:

>>> eva = '日报'
>>> eva.encode('utf-8')              # correct
b'\xe6\x97\xa5\xe6\x8a\xa5'
>>> '%s'%eva.encode('utf-8')         # incorrect
"b'\\xe6\\x97\\xa5\\xe6\\x8a\\xa5'"
>>> b'%s'%eva.encode('utf-8')        # correct but terribly fugly
b'\xe6\x97\xa5\xe6\x8a\xa5'

Notice how '%s'%eva.encode('utf-8') takes the encoded byte string and converts it back into a Unicode representation. The commented-out line shows that you tried eva = eva.encode('utf-8') but then apparently ended up with the wrong result because of the unnecessary % interpolation into a Unicode string.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I change '%s'%eva into eva.encode('utf-8'). The code now is: resp,liujf = con.search('utf-8','SUBJECT',eva.encode('utf-8'), 'Since','01-Feb-2018'). The result is right. Thanks!!! – Carol Chan Feb 26 '18 at 06:24
  • Still `'%s'%something` is a wasteful and inelegant way to write `something`. – tripleee Feb 26 '18 at 06:28
  • What is different between: eva = eva.encode('utf-8') resp,liujf = con.search('utf-8','SUBJECT','%s'%eva, and eva.encode('utf-8') – Carol Chan Feb 26 '18 at 06:32
  • Oh yeah, then `'%s'%eva` is actually wrong because you are converting it back to Unicode (if I understand your question correctly). `b'%s'%eva` would do the right thing and merely be horrendously ugly. See updated answer. – tripleee Feb 26 '18 at 06:34
-2

I think you should first decode and then encode the Chinese literals.If we interpret it as latin-1 encoded, then you decode it first and then encode it. Ex- eva.decode('latin-1').encode('utf-8')

Rock
  • 27
  • 1
  • 8
  • it shows an error: eva = eva.decode('latin-1').encode('utf-8') AttributeError: 'str' object has no attribute 'decode' – Carol Chan Feb 26 '18 at 06:14
  • There is no way to decode Unicode into Latin-1 if the Unicode string contains characters which cannot be represented in Latin-1. If you have a byte string then "decoding" it as Latin-1 will convert it into a Unicode string but then you have a bug somewhere else -- Python 3 quite on purpose forces you to know the encoding of your data, or else keep it as bytes. – tripleee Feb 26 '18 at 06:45