1

I wrote a simple script in colab to pull text files from my drive, put them into a string, run them through a function, and print them out. Some text files are saved as ANSI and the text comes out fine. Some text files were saved as unicode and there is a black diamond question mark after every single character. How can I get rid of these? I have tried errors = 'ignore' as well, and a few other things. But I'm thinking I'm missing something fundamental about character encoding.

os.chdir('/content/drive/My Drive')
for file in glob.glob("*.txt"):
  with open(file, 'r', encoding = 'utf-8', errors='replace') as file:
    Text = file.read()
  print(my_function(Text))
Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
John G.
  • 47
  • 1
  • 8
  • If the files are not using utf-8 encoding, you should not use `encoding = 'utf-8'`. – mkrieger1 Aug 17 '20 at 20:55
  • If you have a replacement character after every other character, try decoding with UTF-16. On Windows, this codec is often called "Unicode". – lenz Aug 17 '20 at 20:59
  • Alright going off what both of you were saying, I changed the code to read `with open(file, 'r', encoding = 'utf-16-le', errors='replace') as file:` which outputs the unicode files correctly and ANSI files incorrectly. `with open(file, 'r', encoding = 'ascii', errors = 'replace') as file:` outputs the ascii text correctly but not the unicode text. – John G. Aug 17 '20 at 21:37
  • @lenz How can I get python to know the format before opening the file so that I can open each format correctly? – John G. Aug 17 '20 at 21:39
  • If the files are typical of Windows and "Unicode" (really, UTF-16) files start with a BOM, then you could try to read the file with `utf16` and if it fails with `UnicodeError`, switch to `ansi`. You could also try the [chardet](https://pypi.org/project/chardet/) module to guess the encoding. – Mark Tolonen Aug 17 '20 at 21:57
  • @MarkTolonen I rewrote the code to try with utf-16 first and if it receives an error I use ascii. Works just fine. Thanks – John G. Aug 17 '20 at 23:32

0 Answers0