0

I have a file called messages.txt which consists of many sentences separated by line. I am attempt to exclude the lines that contain non-alpha characters (I only want those that include characters from A-Z.

import re
import string

lines = [line.rstrip() for line in open('messages.txt', encoding='utf-8')]

cleaned_lines = [s.replace("!", "").replace(".", "").replace("?", "").replace(",", "") for s in lines]

output_lines = []

for line in cleaned_lines:
  if line.replace(' ', '').isalpha() == True:
    output_lines.append(re.sub(r'\W+', '', line.lower()))

chars = sorted(set(('').join(output_lines)))
print(chars)

Output:

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ª', 'â', 'ã', 'å', 'ð', 'ÿ', 'œ', 'š', 'ž', 'ƒ', 'ˆ']

As it can be seen, it seems as if the isalpha() method is not excluding the strange

'â', 'ã', 'å', 'ð', 'ÿ'

characters. I have a feeling that this may be due to the encoding that the file is being read in, however, I would assume that the isalpha method in conjunction with the pattern RegEx should be able to filter out these characters.

Is this intentional? If so, what methods can be used to remove these strange characters?

Rishab Jain
  • 193
  • 1
  • 3
  • 11
  • 2
    It's always a good idea to look at official documentation, [Python isalpha](https://docs.python.org/3/library/stdtypes.html#str.isalpha) clearly mentions it returns `True` for non-empty strings that contain the character that are defined as letter in `utf` – ThePyGuy Jun 15 '21 at 03:27
  • 2
    You could use `isascii()` to filter out strings with non-ascii characters? – Iain Shelvington Jun 15 '21 at 03:28
  • Thank you both for the comments. Based on the documentation, that looks correct! It seems that the correct solution in my case will be to use Iain Shelvington's suggestion and perhaps check for `isascii()` and `isalpha()` in conjunction, while still maintaining the utf-8 encoding. – Rishab Jain Jun 15 '21 at 03:33

2 Answers2

0

Based on my local testing using a UTF-8 encoded Python script, isalpha() was returning false for inputs containing characters with accents:

# -*- coding: utf-8 -*-
inp1 = "Hello"
inp2 = "Hållo"
print(inp1.isalpha())  # True
print(inp2.isalpha())  # False

In any case, if you want to filter off any line containing a non ASCII alphanumeric character, then just use re.search in your initial list comprehension:

lines = [line.rstrip() for line in open('messages.txt', encoding='utf-8') if not re.search(r'[^A-Za-z0-9]', line)]
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you. In my testing, I find a difference. `inp2` returns True. I believe this is because of the `utf-8` encoding in which I am retrieving the file. – Rishab Jain Jun 15 '21 at 03:34
  • 1
    Did you declare your Python script to be using UTF-8? – Tim Biegeleisen Jun 15 '21 at 03:35
  • Ah, yes, I believe I am. However, my text editor may not support this syntax. Thus, I am receiving a different result. Upon testing it in a different text editor, I get the same result as you. Thanks. – Rishab Jain Jun 15 '21 at 03:36
0

When you read a file encoded as UTF-8 with:

lines = [line.rstrip() for line in open('messages.txt', encoding='utf-8')]

The data in lines is Unicode strings. Depending on the OS/Editor used, the accented characters can be "composed" (using a single codepoint for some accented letters) or "decomposed" (using two codepoints, a letter and a combining accent).

You can force the form that works for you:

import unicodedata as ud
inp = "Hello",ud.normalize('NFC',"Hållo"),ud.normalize('NFD',"Hållo")
for i in inp:
    print(i,ascii(i),i.isalpha(),i.isascii())

Output. Notice the ascii() function shows the accended a as a single code point \xe5 or the pair a\u030a:

Hello 'Hello' True True
Hållo 'H\xe5llo' True False
Hållo 'Ha\u030allo' False False

To find only ASCII letters, test with both isalpha() and isascii().

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251