0

I'm reading a bunch of txt.gz files but they have different encoding (at least UTF-8 and cp1252, they are old dirty files). I try to detect the encoding of fIn before reading it in text-mode but I get the error: TypeError: 'GzipFile' object is not callable

The corresponding code:

   # detect encoding
   with gzip.open(fIn,'rb') as file:
        fInEncoding = tokenize.detect_encoding(file) #this doesn't works
        print(fInEncoding)

    for line in gzip.open(fIn,'rt', encoding=fInEncoding[0], errors="surrogateescape"):
        if line.find("From ") == 0:
            if lineNum != 0:
                out.write("\n")
            lineNum +=1
            line = line.replace(" at ", "@")
        out.write(line)

Traceback

$ ./mailmanToMBox.py list-cryptography.metzdowd.com
 ('Converting ', '2015-May.txt.gz', ' to mbox format')
 Traceback (most recent call last):
  File "./mailmanToMBox.py", line 65, in <module>
    main()
  File "./mailmanToMBox.py", line 27, in main
    if not makeMBox(inFile,outFile):
  File "./mailmanToMBox.py", line 48, in makeMBox
    fInEncoding = tokenize.detect_encoding(file.readline()) #this doesn't works                                                         
  File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 423, in detect_encoding                                                 
    first = read_or_stop()
  File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 381, in read_or_stop                                                    
    return readline()
 TypeError: 'bytes' object is not callable

EDIT I tried to use the following code:

# detect encoding
readsource =  gzip.open(fIn,'rb').__next__
fInEncoding = tokenize.detect_encoding(readsource)
print(fInEncoding)

I have no error but it always return utf-8 even when it isn't. My text editor (sublime) detect correctly the cp1252 encoding.

gagarine
  • 4,190
  • 2
  • 30
  • 39

1 Answers1

2

As the documentation of detect_encoding() says, it's input parameter has to be a callable that provides lines of input. That's why you get a TypeError: 'GzipFile' object is not callable.

import tokenize

with open(fIn, 'rb') as f:
    codec = tokenize.detect_encoding(f.readline)[0]

... codec will be "utf-8" or something like that.

user2722968
  • 13,636
  • 2
  • 46
  • 67
  • But why file.readlines() is not callable? I understood that I have to give a callable object, didn't understood how. (I'm new to python) – gagarine May 09 '18 at 07:02
  • The `detect_encoding()`-function will call the function you pass in to get lines of text from the input. It may call it multiple times if it needs more lines to detect the encoding. If you pass `detect_encoding(f.readlines())`, the result of `f.readlines()` gets passed in (which is the whole file, as a list of lines), which is not what `detect_encoding` needs. It's argument is "give me something I can call without any further arguments that gets me more text if I need any". Updated the answer. – user2722968 May 09 '18 at 07:55
  • your solution works but I have the same problem that with the solution I add when I edited my answer a bit before using `readsource = gzip.open(fIn,'rb').__next__` `fInCodec = tokenize.detect_encoding(readsource)[0]` It always return UTF-8 even when their is char that are not UTF-8. But I guess it's another problem. – gagarine May 09 '18 at 08:11
  • 1
    This is a limitation of `tokenize.detect_encoding()`, which is designed to detect the encoding of python-source-files. Specifically, it only looks at the first two lines and only looks for the BOM (which is not always present) and the python-specific encoding-cookie (which is never present for general text files). For non-python-files, a library like [chardet](https://pypi.org/project/chardet/) is probably better. Also see [here](https://stackoverflow.com/questions/436220/determine-the-encoding-of-text-in-python) – user2722968 May 09 '18 at 11:44