3

I have UTF-8 file and I want to replace some characters that are 2 bytes with some HTML tags.

I wanted to make Python script for that. Just read file, char by char, and put some if and so on.

Problem that I have is following, if I read char by char, that I am reading one byte, but some characaters are 1 byte and some are 2 bytes long.

How to solve it ?

I basically need feature that will read char by char, but it will know is this char size of 1 or 2 byte.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
WebOrCode
  • 6,852
  • 9
  • 43
  • 70

1 Answers1

5

You need to open the file while specifying the correct encoding. In Python 3, that's done using

with open("myfile.txt", "r", encoding="utf-8-sig") as myfile:
    contents = myfile.read()
    for char in contents:
        # do something with character

In Python 2, you can use the codecs module:

import codecs
with codecs.open("myfile.txt", "r", encoding="utf-8-sig") as myfile:
    contents = myfile.read()
    for char in contents:
        # do something with character

Note that in this case, Python 2 will not do automatic newline conversion, so you need to handle \r\n line endings explicitly.

As an alternative (Python 2), you can open the file normally and decode it afterwards; that will normalize line endings to \n:

with open("myfile.txt", "r") as myfile:
    contents = myfile.read().decode("utf-8-sig")
    for char in contents:
        # do something with character

Note that in both cases, you will end up with Unicode objects in Python 2, not strings (in Python 3, all strings are Unicode objects).

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • In Python 3, I think UTF-8 is the default file reading mode, making it not necessary to specify (but not a bad idea, either, if that's the explicit intention). Update: this is wrong, per comment below. – Ivan X May 17 '14 at 19:32
  • for printing each char: print char, ord( char ). – WebOrCode May 17 '14 at 19:32
  • Can you explain more what do you mean with "so you need to handle \r\n line endings explicitly". Does this mean that all new lines are lost ? If so, how to save them ? – WebOrCode May 17 '14 at 19:36
  • 1
    @IvanX: UTF-8 is the default *source code* encoding. The default encoding used by `open()` is OS-dependent. On Windows, it's cp1252, for example. – Tim Pietzcker May 17 '14 at 19:36
  • @TimPietzcker Ah, that makes sense. Thanks for clarifying. I was doing quick testing on Linux only which was how I arrived at that assumption. – Ivan X May 17 '14 at 19:37
  • @WebOrCode: Normally, Python converts all line endings to single `\n`s when opening a text file (and converts them back to the system standard for newlines when writing a text file). `codecs.open()` doesn't change the newline format used by the OS. – Tim Pietzcker May 17 '14 at 19:37
  • @TimPietzcker The default encoding is as per `locale.getpreferredencoding`, and I'd expect it to vary according to the user settings on Windows. – Kos May 18 '14 at 10:42