Replacement character (black diamond question mark) after every character in text

Question

I wrote a simple script in colab to pull text files from my drive, put them into a string, run them through a function, and print them out. Some text files are saved as ANSI and the text comes out fine. Some text files were saved as unicode and there is a black diamond question mark after every single character. How can I get rid of these? I have tried errors = 'ignore' as well, and a few other things. But I'm thinking I'm missing something fundamental about character encoding.

os.chdir('/content/drive/My Drive')
for file in glob.glob("*.txt"):
  with open(file, 'r', encoding = 'utf-8', errors='replace') as file:
    Text = file.read()
  print(my_function(Text))

If the files are not using utf-8 encoding, you should not use `encoding = 'utf-8'`. — mkrieger1, Aug 17 '20 at 20:55
If you have a replacement character after every other character, try decoding with UTF-16. On Windows, this codec is often called "Unicode". — lenz, Aug 17 '20 at 20:59
Alright going off what both of you were saying, I changed the code to read `with open(file, 'r', encoding = 'utf-16-le', errors='replace') as file:` which outputs the unicode files correctly and ANSI files incorrectly. `with open(file, 'r', encoding = 'ascii', errors = 'replace') as file:` outputs the ascii text correctly but not the unicode text. — John G., Aug 17 '20 at 21:37
@lenz How can I get python to know the format before opening the file so that I can open each format correctly? — John G., Aug 17 '20 at 21:39
If the files are typical of Windows and "Unicode" (really, UTF-16) files start with a BOM, then you could try to read the file with `utf16` and if it fails with `UnicodeError`, switch to `ansi`. You could also try the [chardet](https://pypi.org/project/chardet/) module to guess the encoding. — Mark Tolonen, Aug 17 '20 at 21:57
@MarkTolonen I rewrote the code to try with utf-16 first and if it receives an error I use ascii. Works just fine. Thanks — John G., Aug 17 '20 at 23:32

Replacement character (black diamond question mark) after every character in text

0 Answers0