How to efficiently remove non-ASCII characters and numbers, but keep accented ASCII characters

Question

I have several strings like this:

s = u'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
s
"awëerwq مرحباмир bròn 1990 23x4 + &23 'we' we's mexicqué"

I couldn't found a way to remove non-printable things like 'مرحباми', but keeping latin characters like 'óé,...'. Also numbers (like '1990') are undesirable in my case. I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'. It is the same problem with using string.printable. I don't know why

ord('ë')
235

Given that the ASCII table it is assigned 137. The result I would to expect is something like this:

x = some_method(s)
"awëerwq bròn 23x4 we we s mexicqué"

Then, I would like to code with no dependence on unfixed codification.

It looks like you want characters with `ord(c) < 256`, but excluding characters like `+` and `'` as well. You might be best off with a hardcoded string with all the characters you want to keep and then just doing `''.join(c for c in s if c in okay_chars)`. — TigerhawkT3, Nov 22 '15 at 01:35
Sorry, I misread the question. You want to preserve all characters used in code page 437, not ASCII, but selectively remove numbers. `ë` is 235 because that is its unicode value. 137 is its value in code page 437. — ayane_m, Nov 22 '15 at 01:36
Wait, if you don't want numbers, how did `23x4` make the cut? — TigerhawkT3, Nov 22 '15 at 01:57
What makes you think Arabic and Cyrillic are "non-printable"? There must be a better phrase for that. — Jongware, Nov 22 '15 at 02:02
By reading the @Martin's post I'm now aware of the terminological mistake. Thank you all — Nacho, Nov 22 '15 at 06:07

score 2 · Accepted Answer · answered Nov 22 '15 at 01:54

Here's a way that might help (Python 3.4):

import unicodedata
def remove_nonlatin(s): 
    s = (ch for ch in s
         if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
    return ''.join(s)

>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4  23 we wes mexicqué'

This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.

For example, this would match:

>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'

And this would not:

>>> unicodedata.name('م')
'ARABIC LETTER MEEM'

I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.

You could presumably filter by code point by using something like ord(c) < 0x250, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.

Thank you Seth... I uniquely added conditional yielding for keeping spaces — Nacho, Nov 22 '15 at 03:58

Martin Konecny · Answer 2 · 2015-11-22T01:37:13.103

I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'.

I think you are asking your question wrong. ASCII does not have the characters óëé in it. Take a look here to see the set of all ASCII characters and see how basic it is:

https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart

It appears that the string you are using is in Unicode since it can support both "مرحباми" as well as "'óëé" at the same time.

In that case, you can find the character ranges you want using

http://jrgraphix.net/research/unicode_blocks.php

and include only the Latin ones (this will filter out Arabic characters for example).

Here's an example:

import re
s = u"مرحباми123"

# prints "123" by keeping all characters from the following ranges:
# 0020 — 007F   Basic Latin
# 00A0 — 00FF   Latin-1 Supplement
# 0100 — 017F   Latin Extended-A
# 0180 — 024F   Latin Extended-B
print ''.join(re.findall(ur'[\u0020-\u007F\u00A0-\u00FF\u0100-\u017F\u0180-\u024F]+', s))

Thank you @Martin, your post is very insightful. I followed Seth's suggestion and keeping in ming differences you have specified. — Nacho, Nov 22 '15 at 06:05

How to efficiently remove non-ASCII characters and numbers, but keep accented ASCII characters

2 Answers2