12

I have a huge text file which I want to open.
I'm reading the file in chunks, avoiding memory issues related to reading too much of the file all at once.

code snippet:

def open_delimited(fileName, args):

    with open(fileName, args, encoding="UTF16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = '{} {} '.format(*pieces[-1]) 
        if remainder:
            yield remainder

the code throws the error UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data.

I tried UTF8 and got the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte.

latin-1 and iso-8859-1 raised the error IndexError: list index out of range

A sample of the input file:

b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'

I will also mention that I have several of those huge text files.
UTF16 works fine for many of them, and fail at a specific file.

Anyway to resolve this issue?

Presen
  • 1,809
  • 4
  • 31
  • 46
  • If your inputfile *is* UTF-16 (albeit truncated), then Latin1 or UTF-8 will certainly not work. – Martijn Pieters Aug 21 '13 at 12:41
  • Can we see a sample of your inputfile? Then at least we can take a stab at guessing the encoding used. Read the file as binary, and print that. `print(open(fileName, 'rb').read(120))` should give us enough to work with. – Martijn Pieters Aug 21 '13 at 12:43
  • @MartijnPieters I added a sample of the input file. – Presen Aug 21 '13 at 12:58
  • 2
    That is most definitely UTF16. If that data is corrupted somewhere, there is little we can do to fix that. You *could* try a different chunk size, perhaps there is a bug in `TextIOWrapper.read()` where a it ends up with a partial read of a surrogate pair. I recommend a power of 2. `16384` is 2**14, for example. – Martijn Pieters Aug 21 '13 at 13:02
  • In any case, trying to use any other codec is not going to work. – Martijn Pieters Aug 21 '13 at 13:03
  • @MartijnPieters I tried `16384`. It didn't work. I can accept a solution were parts that are corrupted in data will be ignored. What will be a good way to do so? – Presen Aug 21 '13 at 14:04
  • That is going to be *hard*. You'd have to detect exactly what offset in the file that would be, bypass the buffer, seek, clear the buffer, then read again. The offset should be calculable from how much you've read so far plus the offset named in the exception. – Martijn Pieters Aug 21 '13 at 14:19
  • Even then, you'd still have to deal with the fact your data file is corrupted in at least one location. How many more corrupted bytes will be present? Is the data recovable *at all*? – Martijn Pieters Aug 21 '13 at 14:30
  • The alternative is to ignore errors altogether by setting `errors='ignore'` on the `open()` call. – Martijn Pieters Aug 21 '13 at 14:34

3 Answers3

10

To ignore corrupted data (which can lead to data loss), set errors='ignore' on the open() call:

with open(fileName, args, encoding="UTF16", errors='ignore') as infile:

The open() function documentation states:

  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

This does not mean you can recover from the apparent data corruption you are experiencing.

To illustrate, imagine a byte was dropped or added somewhere in your file. UTF-16 is a codec that uses 2 bytes per character. If there is one byte missing or surplus then all byte-pairs following the missing or extra byte are going to be out of alignment.

That can lead to problems decoding further down the line, not necessarily immediately. There are some codepoints in UTF-16 that are illegal, but usually because they are used in combination with another byte-pair; your exception was thrown for such an invalid codepoint. But there may have been hundreds or thousands byte-pairs preceding that point that were valid UTF-16, if not legible text.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I tried the `errors='ignore'` and get `remainder = '{} {} '.format(*pieces[-1]) IndexError: list index out of range` – Presen Aug 21 '13 at 14:45
  • Right, because now you are apparently ending up with a chunk where `re.findall()` returns *no matches at all*. That is the risk of ignoring invalid characters; if *one* byte is missing in your file, then the UTF-16 decoding may be unreadable now; it is effectively not detectable *what* byte is missing in that case and the exception you saw could be well past the file corruption. – Martijn Pieters Aug 21 '13 at 14:54
4

I was doing the same thing (reading many large text files in chunks) and ran into the same error with one the files:

Traceback (most recent call last):
  File "wordcount.py", line 128, in <module>
    decodedtext = rawtext.decode('utf8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 9999999: unexpected end of data

Here's what I found: the problem was a particular Unicode sequence (\xc2\xa0\xc2\xa0) spanning two chunks. Thus that sequence was split and became undecodable. Here's how I solved it:

# read text
rawtext = file.read(chunksize)

# fix splited end
if chunknumber < totalchunks:
    while rawtext[-1] != ' ':
        rawtext = rawtext + file.read(1)

# decode text
decodedtext = rawtext.decode('utf8')

This also solves the more general problem of words being cut in half when they span two chunks.

Parzival
  • 2,004
  • 4
  • 33
  • 47
0

This can also happen in Python 3 when you read/write a io.StringIO object instead of io.BytesIO

Davi Lima
  • 800
  • 1
  • 6
  • 20