Python convert binary file into string while ignoring non-ascii characters

Question

I have a binary file and I want to extract all ascii characters while ignoring non-ascii ones. Currently I have:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

However I'm encountering an error when writing to file UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). How would I get Python to ignore non-ascii?

Are you sure that the file does not have unicode characters within? — dawg, May 08 '15 at 13:19
It looks like your input file is encoded as utf-16-le, so you should specify that encoding when you open the file. In Python 2 you need to use [codecs.open](https://docs.python.org/2/library/codecs.html#codecs.open), but in Python 3 you can use the normal built-in [open](https://docs.python.org/3/library/functions.html#open) — PM 2Ring, May 08 '15 at 13:35

score 5 · Accepted Answer · edited Dec 09 '18 at 23:15

5

Use the built-in ASCII codec and tell it to ignore any errors, like:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

You can test & play around with this in the Python interpreter:

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

Just trying to convert to a string throws an exception.

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...as does just trying to encode that unicode string to ASCII:

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...but telling the codec to ignore the characters it can't handle works okay:

>>> s.encode('ascii', 'ignore')
'hello  there'

edited Dec 09 '18 at 23:15

MagTun

5,619
5
63
104

answered May 08 '15 at 13:19

bgporter

35,114
8
59
65

Is there a predetermined range for what is Python considers ascii? Output is still picking up characters such as SOH,ACK (not sure what these are I'm just typing them as they appear in Sublime Text). – Helen Che May 08 '15 at 13:23
1

@VeraWang SOH and ACK are ASCII. The range is 0 to 127 and those are 1 and 6. – Stefan Pochmann May 08 '15 at 13:28
2

@VeraWang -- ASCII characters 0..31 are non-printable (including those two, see the charts on this wikipedia page about ASCII - http://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart) Maybe more information on the actual problem you're trying to solve would be useful if this isn't giving you what you need... – bgporter May 08 '15 at 13:34

Spirine · Answer 2 · 2015-05-08T13:48:08.287

3

Basically, the ASCII table takes value in range [0, 2⁷) and associates them to (writable or not) characters. So, to ignore non-ASCII characters, you just have to ignore characters whose code isn't comprise in [0, 2⁷), aka inferior or equal to 127.

In python, there is a function, called ord, which accordingly to the docstring

Return the integer ordinal of a one-character string.

In other words, it gives you the code of a character. Now, you must ignore all characters that, passed to ord, return a value greater than 128. This can be done by:

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

Now, if you just want to conserve printable characters, you must notice that all of them - in the ASCII table at least - are between 32 (space) and 126 (tilde), so you must simply do:

if 32 <= ord(character) <= 126:

edited May 08 '15 at 13:48

answered May 08 '15 at 13:25

Spirine

1,837
1
16
28

So if I only wanted ASCII *printable* characters [32, 127] it's a simple `ord(char) < 128 and ord(char) > 31`? – Helen Che May 08 '15 at 13:31
@VeraWang Almost (127 isn't printable), although `31 < ord(char) < 127` is simpler. – Stefan Pochmann May 08 '15 at 13:32
@VeraWang That's almost it! You've forgotten that 127 is the DELETE character, not printable, so the interval is now the closed [32, 126]: `ord(character) <= 126 and ord(character) >= 32` – Spirine May 08 '15 at 13:35
@StefanPochmann I just noticed it, sorry, but I can't edit for just four characters :( how could I do? – Spirine May 08 '15 at 13:37
Geez what a sucky rule. Maybe add some long random word and then remove it in another edit? – Stefan Pochmann May 08 '15 at 13:38
1

Or change to `32 <= ord(character) <= 126`, as that's apparently what she wants anyway. That should be enough change then. – Stefan Pochmann May 08 '15 at 13:39
1

You keep doing that as `if ord(character) >= 32 and ord(character) <= 126`... why? – Stefan Pochmann May 08 '15 at 13:46

Python convert binary file into string while ignoring non-ascii characters

2 Answers2