I'm reading a bunch of txt.gz files but they have different encoding (at least UTF-8 and cp1252, they are old dirty files). I try to detect the encoding of fIn
before reading it in text-mode but I get the error: TypeError: 'GzipFile' object is not callable
The corresponding code:
# detect encoding
with gzip.open(fIn,'rb') as file:
fInEncoding = tokenize.detect_encoding(file) #this doesn't works
print(fInEncoding)
for line in gzip.open(fIn,'rt', encoding=fInEncoding[0], errors="surrogateescape"):
if line.find("From ") == 0:
if lineNum != 0:
out.write("\n")
lineNum +=1
line = line.replace(" at ", "@")
out.write(line)
Traceback
$ ./mailmanToMBox.py list-cryptography.metzdowd.com
('Converting ', '2015-May.txt.gz', ' to mbox format')
Traceback (most recent call last):
File "./mailmanToMBox.py", line 65, in <module>
main()
File "./mailmanToMBox.py", line 27, in main
if not makeMBox(inFile,outFile):
File "./mailmanToMBox.py", line 48, in makeMBox
fInEncoding = tokenize.detect_encoding(file.readline()) #this doesn't works
File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 423, in detect_encoding
first = read_or_stop()
File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 381, in read_or_stop
return readline()
TypeError: 'bytes' object is not callable
EDIT I tried to use the following code:
# detect encoding
readsource = gzip.open(fIn,'rb').__next__
fInEncoding = tokenize.detect_encoding(readsource)
print(fInEncoding)
I have no error but it always return utf-8 even when it isn't. My text editor (sublime) detect correctly the cp1252 encoding.