1
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import chardet
s = '123'.encode('utf-8')
print(s)
print(chardet.detect(s))

ss ='编程'.encode('utf-8')
print(chardet.detect(ss))

and results

b'123'
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}

Why it can not detect s as UTF-8?

And why is ASCII?

Is this line useless? # -*- coding: utf-8 -*- Python newcomer, thanks!

alwayslz
  • 119
  • 1
  • 8
  • And I learned that encode as default in python is unicode? How can I prove it? and how it affects? – alwayslz Sep 09 '17 at 14:37
  • 1
    Chartdet uses heuristics. **It'll always be a guess**. – Martijn Pieters Sep 09 '17 at 14:46
  • 2
    And ASCII is entirely correct. ASCII is a subset of UTF-8. The first 128 characters of the Unicode standard are the same as the ASCII standard. Encoding just characters from the ASCII range to UTF-8 results in the **exacts same bytes** as encoding those same characters to ASCII. – Martijn Pieters Sep 09 '17 at 14:47
  • So, '123'.encode('utf-8') means: convert string "123" which are encoded as "utf-8" to sequences of bytes. – RedEyed Sep 09 '17 at 14:47
  • 2
    In other words, any valid ASCII document is also a valid UTF-8 document. – Martijn Pieters Sep 09 '17 at 14:49

1 Answers1

1

Let's just talk about these lines--all the meat is there:

s = '123'.encode('utf-8')
print(s)

You are correct that Python 3 uses Unicode by default. When you say '123'.encode() you are converting a Unicode string to a sequence of bytes which will then print with the ugly b prefix to remind you that it is not a default type of string.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 2
    The coding part is definitely used by Python *too*: https://www.python.org/dev/peps/pep-0263/ In Python 3 UTF-8 just happens to be the default. – Martijn Pieters Sep 09 '17 at 14:47
  • @MartijnPieters: Oh wow thanks for pointing that out. That feature is so totally out of line with the Python philosophy it's hard to imagine how it got in. – John Zwinck Sep 09 '17 at 23:50
  • Why is that out of line? Python 3 identifiers and string literals support Unicode characters, you *have* to have a method of setting an encoding for a source file. – Martijn Pieters Sep 10 '17 at 10:33
  • @MartijnPieters: Oh, I should have explained. It violates "explicit is better than implicit" by using another program's comment convention (Emacs); it is also probably the first and only feature of Python which changes program semantics based on a comment. – John Zwinck Sep 10 '17 at 11:27
  • No, read the PEP. Python *tolerates* additional information in the comment so you can re-use it for Emacs or VIM or any other editor that supports an encoding-in-a-comment setting. – Martijn Pieters Sep 10 '17 at 14:35