I have a file called messages.txt
which consists of many sentences separated by line. I am attempt to exclude the lines that contain non-alpha characters (I only want those that include characters from A-Z.
import re
import string
lines = [line.rstrip() for line in open('messages.txt', encoding='utf-8')]
cleaned_lines = [s.replace("!", "").replace(".", "").replace("?", "").replace(",", "") for s in lines]
output_lines = []
for line in cleaned_lines:
if line.replace(' ', '').isalpha() == True:
output_lines.append(re.sub(r'\W+', '', line.lower()))
chars = sorted(set(('').join(output_lines)))
print(chars)
Output:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ª', 'â', 'ã', 'å', 'ð', 'ÿ', 'œ', 'š', 'ž', 'ƒ', 'ˆ']
As it can be seen, it seems as if the isalpha() method is not excluding the strange
'â', 'ã', 'å', 'ð', 'ÿ'
characters. I have a feeling that this may be due to the encoding that the file is being read in, however, I would assume that the isalpha method in conjunction with the pattern RegEx should be able to filter out these characters.
Is this intentional? If so, what methods can be used to remove these strange characters?