3

I have a binary file and I want to extract all ascii characters while ignoring non-ascii ones. Currently I have:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

However I'm encountering an error when writing to file UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). How would I get Python to ignore non-ascii?

Helen Che
  • 1,951
  • 5
  • 29
  • 41
  • 1
    Are you sure that the file does not have unicode characters within? – dawg May 08 '15 at 13:19
  • It looks like your input file is encoded as utf-16-le, so you should specify that encoding when you open the file. In Python 2 you need to use [codecs.open](https://docs.python.org/2/library/codecs.html#codecs.open), but in Python 3 you can use the normal built-in [open](https://docs.python.org/3/library/functions.html#open) – PM 2Ring May 08 '15 at 13:35

2 Answers2

5

Use the built-in ASCII codec and tell it to ignore any errors, like:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

You can test & play around with this in the Python interpreter:

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

Just trying to convert to a string throws an exception.

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...as does just trying to encode that unicode string to ASCII:

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...but telling the codec to ignore the characters it can't handle works okay:

>>> s.encode('ascii', 'ignore')
'hello  there'
MagTun
  • 5,619
  • 5
  • 63
  • 104
bgporter
  • 35,114
  • 8
  • 59
  • 65
  • Is there a predetermined range for what is Python considers ascii? Output is still picking up characters such as SOH,ACK (not sure what these are I'm just typing them as they appear in Sublime Text). – Helen Che May 08 '15 at 13:23
  • 1
    @VeraWang SOH and ACK are ASCII. The range is 0 to 127 and those are 1 and 6. – Stefan Pochmann May 08 '15 at 13:28
  • 2
    @VeraWang -- ASCII characters 0..31 are non-printable (including those two, see the charts on this wikipedia page about ASCII - http://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart) Maybe more information on the actual problem you're trying to solve would be useful if this isn't giving you what you need... – bgporter May 08 '15 at 13:34
3

Basically, the ASCII table takes value in range [0, 27) and associates them to (writable or not) characters. So, to ignore non-ASCII characters, you just have to ignore characters whose code isn't comprise in [0, 27), aka inferior or equal to 127.

In python, there is a function, called ord, which accordingly to the docstring

Return the integer ordinal of a one-character string.

In other words, it gives you the code of a character. Now, you must ignore all characters that, passed to ord, return a value greater than 128. This can be done by:

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

Now, if you just want to conserve printable characters, you must notice that all of them - in the ASCII table at least - are between 32 (space) and 126 (tilde), so you must simply do:

if 32 <= ord(character) <= 126:
Spirine
  • 1,837
  • 1
  • 16
  • 28