I would like to know how to remove all bytes which can't be decode. Is there a solution?
This is simple:
with open('filename', 'r', encoding='utf8', errors='ignore') as f:
...
The errors='ignore'
tells Python to drop unrecognized characters. It can also be passed to bytes.decode()
and most other places which take an encoding
argument.
Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()
) and then re-open in 'rb'
mode.
In Python 2, these arguments to the built-in open()
don't exist, but you can use io.open()
instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.
But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj()
:
import unicodedata
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if unicodedata.category(c) != 'Cc')
Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))
This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.
In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj()
, but that's like using a sledgehammer to swat a fly.