1

I am trying to open and change several text files. My files are in 'Latin-1' but when I use f.read all the letters with accents are converted into "ã". My code is:

for dname, dirs, files in os.walk("mydirection"):
    for fname in files:
        fpath = os.path.join(dname, fname)
        with open(fpath, encoding='latin-1') as f:
            text = f.read()
            text = text.replace(r'- ', '')
            # remove punctuation
            text = re.sub(r'[^\w\s]', ' ', text)
        with open(fpath, 'w', encoding='latin-1') as file:
            file.write(text)

Is it possible to write and change the text files and keep them in 'Latin-1'?

Example of text file: "élève,"

What I want: "élève" (or if it is not possible "eleve")

What I am obtaining: "ã l ã ve"

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
MG Fern
  • 75
  • 9
  • 1
    You're using `encoding='latin-1'` when you read the files. Why aren't you also doing that when you write them? – John Gordon Apr 21 '23 at 15:25
  • Have you tried specifying the encoding on write? – asimoneau Apr 21 '23 at 15:25
  • If I specify the encoding when I write, it also does not work – MG Fern Apr 21 '23 at 15:45
  • 1
    Have you restored the original version of the files? Currently your code overwrites the existing files, so the next time you run your code it will read the file as you modified it, not as it was before you started coding. – slothrop Apr 21 '23 at 15:50
  • Yes, I restored the original files. Every time that I try, I delete the wrong final output, and copy the original files to the folder again. – MG Fern Apr 21 '23 at 15:55
  • The problem is in the ```f.read```. It is in this part that is converting the "é" into "ã " – MG Fern Apr 21 '23 at 15:58
  • 3
    The most plausible thing I can come up with is that the input file is in fact encoded as UTF-8 not Latin-1. In UTF-8, most accented characters (including `é`) are encoded as `0xc3` followed by another byte. If those bytes were decoded as Latin-1, then `0xc3` would decode to `Ã`. But that doesn't explain why you'd see lower-case `ã`. – slothrop Apr 21 '23 at 16:18
  • Thank you, slothrop for your comment. Is it not possible to solve in that case? – MG Fern Apr 21 '23 at 16:50
  • @MGFern if that *is* the case, then just specify `encoding='utf-8'` when reading (or omit it, since it's the default), and you can still specify `encoding='latin-1'` when writing if you want your output in that format. However, I'm not sure this *is* the problem because it doesn't explain getting lower-case `ã` rather than upper-case `Ã`. – slothrop Apr 21 '23 at 16:56
  • If I write ```encoding='utf-8``` I obtain the error ```UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 74: invalid continuation byte``` – MG Fern Apr 21 '23 at 17:03
  • Ah right, `0xe9` is `é` in Latin-1, so looks like I was on the wrong track there. Sorry, I'm stumped :( – slothrop Apr 21 '23 at 17:06
  • What happens if you don't do the `text.replace` and `re.sub` steps? – mkrieger1 Apr 24 '23 at 20:10
  • The documents do not change. I do not have the 'ã ' problem – MG Fern Apr 25 '23 at 09:41
  • I can't reproduce your problem. What version of Python are you using? – Booboo Apr 25 '23 at 17:19
  • 3.9. I find the same error with both notebook and pycharm – MG Fern Apr 25 '23 at 20:35
  • 2
    Please update your question with the file content read in binary mode. That is, please do: `print(open(fpath, 'rb').read())` and copy and paste the output to the question, so that we can see exactly what input data you're dealing with. – blhsing Apr 28 '23 at 07:27

2 Answers2

1

Your file is probably written using the UTF-8 encoding (which is vastly used by default by text editors). For example, the word élève is encoded :

  • b'\xc3\xa9l\xc3\xa8ve' using UTF-8
  • b'\xe9l\xe8ve' using latin-1

When you read and UTF-8 encoded file using a latin-1 encoding, it will use the latin-1 byte map. The \xa9 byte will be translated into © and \xa8 into ¨. Printing the latin-1 decoded byte encoded UTF-8 string élève gives élève. Now as stated in the re documentation, \w match any Unicode word character (or alphanumeric). Here, it simply removes © and ¨, returning you the not wanted à là ve.

You must use the correct encoding. When manipulating data (using read and write in code), you must know how it has been encoded and how to decode it. You should never let encoding unspecified. Here you may have to use UTF-8:

import re

with open(fpath, encoding="utf-8") as f:
    text = f.read()
    print(text)
    text = text.replace(r"- ", "")
    # remove punctuation
    re.sub(r"[^\w\s]", " ", text)


with open(f"{fpath}.out", "w", encoding="utf-8") as file:
    file.write(text)
ftorre
  • 501
  • 1
  • 9
  • I was thinking on these lines in the original comment (https://stackoverflow.com/questions/76074563/trying-to-rewrite-text-files-in-latin-1-results-in-wrong-characters-for-letter#comment134166726_76074563). But I can't explain why OP sees lower-case `ã` rather than `Ã`. Also, trying to open the file as UTF-8 throws `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9` according to OP - so it does seem to have `\xe9` (expected if it's Latin-1 encoded) rather than `\xa9` (expected in UTF-8). – slothrop Apr 29 '23 at 14:29
  • I was trying to figure out a way to work only with bytes but it seems impossible. I was trying to indicate that OP has to find the right encoding. – ftorre Apr 29 '23 at 14:41
0

The issue might be caused by the regular expression used to remove punctuation.

Instead of using r'[^\w\s]', you should use r'[^\w\s]' with the re.UNICODE flag to ensure it correctly handles non-ASCII characters. Can you try this?

text = re.sub(r'[^\w\s]', ' ', text, flags=re.UNICODE)
Grimlock
  • 1,033
  • 1
  • 9
  • 23
  • Thank you, Grimlock for your suggestion, but the problem persists. I also tried without the ```replace``` part – MG Fern Apr 25 '23 at 09:40