Remove all characters which cannot be decoded in Python

Question

I try to parse a html file with a Python script using the xml.etree.ElementTree module. The charset should be UTF-8 according to the header. But there is a strange character in the file. Therefore, the parser can't parse it. I opened the file in Notepad++ to see the character . I tried to open it with several encodings but I don't find the correct one.

As I have many files to parse, I would like to know how to remove all bytes which can't be decode. Is there a solution?

does this help? http://stackoverflow.com/questions/21698024/how-to-correctly-parse-utf-8-xml-with-elementtree — ashwinjv, Jun 18 '15 at 18:15
Do you need a solution for both Python 2 **and** Python 3 ? It is often much easy to answer in only one version is of concern. I was thinking of a way to note the lines containing offending characters for further analysis, but it might be easier for one single version — Serge Ballesta, Jun 18 '15 at 18:20
@hgwells Unfortunately it does not help as the solution is "Leave decoding the bytes to the parser" which is what I am doing... — clemtoy, Jun 18 '15 at 18:24
@SergeBallesta I need a solution in Python 3 for now, but I am sure that a Python 2 version would be interesting too. — clemtoy, Jun 18 '15 at 18:26
From your comment on Kevin's answer, I think that simply removing offending characters could not be enough as is exactly what his answer does. Could you show some part of the file (ideally around the line causing the error) ? — Serge Ballesta, Jun 18 '15 at 18:44
@SergeBallesta He edited his answer after I wrote the comment. So I think his solution might be right (I am trying to understand and code his solution...) — clemtoy, Jun 18 '15 at 18:51
possible duplicate of [Parse file in robust way with python 3](http://stackoverflow.com/questions/24741168/parse-file-in-robust-way-with-python-3) — skrrgwasme, Jun 18 '15 at 19:42

Kevin · Accepted Answer · 2015-06-18T19:49:59.983

I would like to know how to remove all bytes which can't be decode. Is there a solution?

This is simple:

with open('filename', 'r', encoding='utf8', errors='ignore') as f:
    ...

The errors='ignore' tells Python to drop unrecognized characters. It can also be passed to bytes.decode() and most other places which take an encoding argument.

Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()) and then re-open in 'rb' mode.

In Python 2, these arguments to the built-in open() don't exist, but you can use io.open() instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.

But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj():

import unicodedata

def strip_control_chars(data: str) -> str:
    return ''.join(c for c in data if unicodedata.category(c) != 'Cc')

Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):

def strip_control_chars(data: str) -> str:
    return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))

This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.

In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj(), but that's like using a sledgehammer to swat a fly.

@clemtoy: If you want to be able to see where the offending characters were, you could try `errors='replace'` instead of `errors='ignore'` — Serge Ballesta, Jun 18 '15 at 18:59
I am not sure to understand how to remove the characters from the file, is the solution something like `with open(filename, 'r', errors='ignore') as i, open('out.html', 'w') as o: shutil.copyfileobj(i, o)` ? (@SergeBallesta) — clemtoy, Jun 18 '15 at 19:05
@clemtoy: That looks like it ought to work, but you should probably specify the encoding explicitly in both open calls. Also, make sure you don't accidentally overwrite the input file (e.g. if `filename == 'out.html'`). — Kevin, Jun 18 '15 at 19:07
@Kevin Well, I just noticed I can open the file like this `open(filename, 'r', encoding='utf8')` and read it without any errors (the encoding is necessary, but I don't need to write `errors='ignore'`) So when I copy the file with `shutil.copyfileobj()` the character is copied. Anyway, the `xml.etree.ElementTree` parser does not accept the character... Sorry guys, I don't want to waste your time... — clemtoy, Jun 18 '15 at 19:26
@clemtoy: OK, you have valid UTF-8 which contains control characters. We can filter those out too. — Kevin, Jun 18 '15 at 19:38

Remove all characters which cannot be decoded in Python

1 Answers1