How to read UTF file char by char in Python

Question

I have UTF-8 file and I want to replace some characters that are 2 bytes with some HTML tags.

I wanted to make Python script for that. Just read file, char by char, and put some if and so on.

Problem that I have is following, if I read char by char, that I am reading one byte, but some characaters are 1 byte and some are 2 bytes long.

How to solve it ?

I basically need feature that will read char by char, but it will know is this char size of 1 or 2 byte.

It will be helpful to post the code you have written so far, also point out what python version you're using. — Paulo Bu, May 17 '14 at 19:22
By "char" you mean a code point? They go up to 6 bytes in UTF-8 — Kos, May 17 '14 at 19:22
Please at least post the examples of your files content and how you want to read it. — shshank, May 17 '14 at 19:23

Tim Pietzcker · Accepted Answer · 2014-05-17T19:32:19.097

5

You need to open the file while specifying the correct encoding. In Python 3, that's done using

with open("myfile.txt", "r", encoding="utf-8-sig") as myfile:
    contents = myfile.read()
    for char in contents:
        # do something with character

In Python 2, you can use the codecs module:

import codecs
with codecs.open("myfile.txt", "r", encoding="utf-8-sig") as myfile:
    contents = myfile.read()
    for char in contents:
        # do something with character

Note that in this case, Python 2 will not do automatic newline conversion, so you need to handle \r\n line endings explicitly.

As an alternative (Python 2), you can open the file normally and decode it afterwards; that will normalize line endings to \n:

with open("myfile.txt", "r") as myfile:
    contents = myfile.read().decode("utf-8-sig")
    for char in contents:
        # do something with character

Note that in both cases, you will end up with Unicode objects in Python 2, not strings (in Python 3, all strings are Unicode objects).

edited May 17 '14 at 19:32

answered May 17 '14 at 19:25

Tim Pietzcker

328,213
58
503
561

In Python 3, I think UTF-8 is the default file reading mode, making it not necessary to specify (but not a bad idea, either, if that's the explicit intention). Update: this is wrong, per comment below. – Ivan X May 17 '14 at 19:32
for printing each char: print char, ord( char ). – WebOrCode May 17 '14 at 19:32
Can you explain more what do you mean with "so you need to handle \r\n line endings explicitly". Does this mean that all new lines are lost ? If so, how to save them ? – WebOrCode May 17 '14 at 19:36
1

@IvanX: UTF-8 is the default *source code* encoding. The default encoding used by `open()` is OS-dependent. On Windows, it's cp1252, for example. – Tim Pietzcker May 17 '14 at 19:36
@TimPietzcker Ah, that makes sense. Thanks for clarifying. I was doing quick testing on Linux only which was how I arrived at that assumption. – Ivan X May 17 '14 at 19:37
@WebOrCode: Normally, Python converts all line endings to single `\n`s when opening a text file (and converts them back to the system standard for newlines when writing a text file). `codecs.open()` doesn't change the newline format used by the OS. – Tim Pietzcker May 17 '14 at 19:37
@TimPietzcker The default encoding is as per `locale.getpreferredencoding`, and I'd expect it to vary according to the user settings on Windows. – Kos May 18 '14 at 10:42

How to read UTF file char by char in Python

1 Answers1