2

there are some other questions here regarding this problem, but none of them fixed my problem so far.

I have a large (40MB) CSV file. Most of the file is encoded in iso-8859-1 (latin1), but some entries (just entries!) are in utf-8.

If i try to open the file with utf-8 encoding, python already throws an encoding error on me. If i open the file in iso-8859-1, the file can be read, but some entries stay mojibake then.

I tried the following code to fix the issues linewise, but obviously i miss something, because the utf-8 entries stay mojibake.


import os
import ftfy

# There are issues with file encoding with this file. Try to convert it to proper utf-8
sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
sourceFileName = os.path.join(folder, file + config.CSV_EXTENSION)
targetFileName = os.path.join(folder, file + "_temp" + config.CSV_EXTENSION)

# input file
csvfile = open(sourceFileName, "r", encoding = sourceEncoding)
Lines = csvfile.readlines()
for line in Lines:
    line = ftfy.fix_encoding(line) 

# output stream
outfile = open(targetFileName, "w", encoding = targetEncoding) # Windows doesn't like utf8
outfile.writelines(Lines)

# close files
csvfile.close()
outfile.close()

os.remove(sourceFileName)
os.rename(targetFileName, sourceFileName)

One specific string i have looks like this:

Ãberarbeitung A6 Heft

I want it to look like this:

Überarbeitung A6 Heft

Edit:

Some clarifications.

I assume there are some encoding issues in the file, because i know there are two different sources for entries in the csv. The most ones come into it by typing the value into a GUI. Some values come from a self-written script with god knows what encoding.

If i open the CSV in VSCode it assumes it is ISO-8859-1. But then some entries look like i mentioned above:

 Ãberarbeitung A6 Heft

If i change the encoding to UTF-8, this entry becomes 'right':

Überarbeitung A6 Heft

But then other entries change to the worse:

Testdurchf�hrung

The error message when trying to open the file with utf-8 encoding is:

Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0xe4 in position 56: invalid continuation byte

I will try to import the csv binary and decode it linewise. Maybe this will do the trick.

dibade89
  • 33
  • 4
  • There's no such thing as "mixed encodings". That file isn't UTF8 either. In UTF8 any character outside the 7-bit US-ASCII range uses 2 bytes. The US-ASCII characters are identical to Latin1. Both `Ã` and `Ü` would appear as two characters if you uses Latin1 to read a UTF8 file – Panagiotis Kanavos Mar 15 '23 at 16:14
  • Use `chardet` to try and detect what encoding is used in that file. The very fact you get a *single* character for `Ü` means it's not any Unicode encoding. It's certainly not Latin1 or US-ASCII – Panagiotis Kanavos Mar 15 '23 at 16:16
  • As for `# Windows doesn't like utf8`, well actually, Windows is a native Unicode OS, unlike Linux. Since the very first Windows NT version back in 1994, Windows was built to be exclusively Unicode. The ASCII APIs were meant for compatibility reasons with legacy applications until they too were recompiled to Unicode. The "system locale" isn't the system locale. As the settings page explains, it's the default encoding used by non-Unicode applications – Panagiotis Kanavos Mar 15 '23 at 16:18
  • Turkish maybe? ISO 8859-9? `chardet.detect(codecs.encode("Ãberarbeitung A6 Heft",encoding="latin1"))` returns `'encoding': 'ISO-8859-9', 'confidence': 0.8516926783630553, 'language': 'Turkish'` – Panagiotis Kanavos Mar 15 '23 at 16:21
  • Your file _is_ `utf-8` supposedly. `'Ü'.encode('utf-8').decode('latin1')` returns `Ã\x9c` where the latest character is _unprintable_ `\x9c` (U+009C, *String Terminator*)… – JosefZ Mar 15 '23 at 16:57
  • 1
    Hi . Please edit your question and include the actual error you get when trying to open the file as UTF-8. – Zach Young Mar 15 '23 at 18:16
  • 1
    @PanagiotisKanavos, 'In UTF8 any character outside the 7-bit US-ASCII range uses 2 bytes.' I believe that should be, '...uses **2-4** bytes.' – Zach Young Mar 15 '23 at 18:26
  • Note that `latin1` will decode *anything* so it should be used as a last resort. UTF-8 has a very specific decoding, and should attempted first. For files coming from Windows, `Windows-1252` is an encoding similar to `latin-1` but has a few more specific characters such as the Euro sign and smart quotes that `latin-1` will decode incorrectly. – Mark Tolonen Mar 15 '23 at 22:42

3 Answers3

3

You can read in the binary mode and decode each line manually:

def try_decode(b, encodings):
    for enc in encodings:
        try:
            return b.decode(enc)
        except UnicodeDecodeError:
            pass
    raise ValueError('no matching encoding!')

with open(YOUR_FILE, 'rb') as fp:
    for b in fp:
        line = try_decode(b, ['utf8', 'latin1'])
gog
  • 10,367
  • 2
  • 24
  • 38
1

It's possible for a file to be written w/"mixed encodings", and, based on your comment on Panagiotis_Kanavos' answer, even seems likely:

# Process 1
with open("input.csv", "w", encoding="utf-8") as f_in:
    f_in.write("Überarbeitung 1\n")

# Process 2 (not aware of Process 1's intent)
with open("input.csv", "a", encoding="iso-8859-1") as f_in:
    f_in.write("Testdurchführung 2\n")

If your file is a mix of ISO-8859-1 (latin1) and UTF-8, then @gog's answer gives you a template for fixing.

Be sure to specify UTF-8 before ISO-8859-1, otherwise a valid UTF-8 multibyte sequence will be decoded as a sequence of individual single-byte latin1 characters:

# Adapt @gog's solution
def try_decode(b: bytes) -> str | None:
    for enc in ["utf-8", "iso-8859-1"]:
        try:
            s = b.decode(enc)
            print(f"{enc:<10}: {b}")  # log the encoding
            return s
        except UnicodeDecodeError:
            pass

    return None


with (
    open("input.csv", "rb") as f_in,
    open("output.csv", "w") as f_out,
):
    for i, line in enumerate(f_in, start=1):
        s = try_decode(line)
        if s:
            f_out.write(s)
        else:
            print(f"could not decode line {i}: {line}")

That prints:

utf-8     : b'\xc3\x9cberarbeitung 1\n'
iso-8859-1: b'Testdurchf\xfchrung 2\n'

and output.csv is uniform UTF-8.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
-1

Try iso-8859-9 ie Turkish, instead of Latin1. Or use chardet on the actual file to detect possible encodings.


There's no such thing as "mixed encodings". That file isn't UTF8 either. In UTF8 any character outside the 7-bit US-ASCII range uses 2 bytes. The US-ASCII characters are identical to Latin1. Both à and Ü would appear as two characters if you uses Latin1 to read a UTF8 file.

This isn't a Windows issue either - Windows is a native Unicode OS. Since the very first Windows NT version back in 1994, Windows was built to be exclusively Unicode. The ASCII APIs were meant for compatibility reasons with legacy applications until they too were recompiled to Unicode. The "system locale" isn't the system locale. As the settings page explains, it's the default encoding used by non-Unicode applications

You can use the chardet package to try to detect the encoding used in a file. This is done by checking for invalid byte sequences and statistical analysis to detect the most likely codepage that matches the byte frequency of the file's bytes.

With just a single phrase detection is rather chancy, but this

chardet.detect_all(codecs.encode("Ãberarbeitung A6 Heft",encoding="latin1"))

Produces

[
    {'encoding': 'ISO-8859-9', 'confidence': 0.8516926783630553, 'language': 'Turkish'}, 
    {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}, 
    {'encoding': 'MacRoman', 'confidence': 0.494516129032258, 'language': ''}
]

We already know it's not Latin1 and I highly doubt it's MacRoman

UTF8 would encode either à or Ü as two bytes. All the characters in the US-ASCII range would use the same byte values as US-ASCII.

>>> codecs.encode("Ãberarbeitung A6 Heft")
b'\xc3\x83berarbeitung A6 Heft'
>>> codecs.encode("Überarbeitung A6 Heft")
b'\xc3\x9cberarbeitung A6 Heft'
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • 2
    Im quite sure its somehow mixed, because there are two different sources how the data came into the csv. The better ones (i assumed to be latin) were directly typed into a gui, while the worse ones came from a self written script with god knows what type encoding. – dibade89 Mar 15 '23 at 18:35