Here is a solution that handles every case I could think to throw at it.
The code loops over each character in a word and detects whether its unicode value is greater than 127, if it is, it detects the original position of the character and then adds appropriate white space such that the split()
function can be used later to separate multiple special characters throughout the word.
In the first part of your question you want the output in list
form, with each item as a token. Later you request the output to be in string
form, with special characters (superscripts) as a separate token in the string. The solution below contains the option to do either: just comment/uncomment the lines which use temp_text_lst
or temp_text_str
depending on whether you want a tokenised list output, or just a string.
Solution
texts = [
"Sentence testString™", "Sentence ™testString", "Sentence test™String",
"Sentence testString™™", "Sentence ™™testString", "Sentence test™™String",
"Sentence testString™™™", "Sentence ™™™testString", "Sentence test™™™String",
"Sentence ™test™String", "Sentence test™String™", "Sentence ™test™String™",
"Sentence ᵟteᶿstStr™™ingᵝ", "™ testString"
]
processed_texts = []
for text in texts:
# Comment/uncomment occurrences of 'temp_text_lst' / 'temp_text_str'
# throughout this loop to change output format
temp_text_lst = []
# temp_text_str = ""
for word in text.split():
temp_word = ""
found_non_alpha_num = 0
for i, char in enumerate(word):
# Condition to detect & separate special chars attached to word
if ord(char) > 127:
if i == 0:
temp_word = char + " " + temp_word
elif i == len(word) - 1:
temp_word = temp_word + " " + char
else:
temp_word = temp_word + " " + char + " "
found_non_alpha_num = 1
else:
temp_word += char
if found_non_alpha_num:
for token in temp_word.split():
temp_text_lst.append(token)
# temp_text_str = temp_text_str + token + " "
else:
temp_text_lst.append(temp_word)
# temp_text_str = temp_text_str + temp_word + " "
# Append processed text
processed_texts.append(temp_text_lst)
# Append processed text without the redundant white space added to the end of the last token
# processed_texts.append(temp_text_str[:-1])
for text in processed_texts:
print(text)
Output (list
format)
['Sentence', 'testString', '™']
['Sentence', '™', 'testString']
['Sentence', 'test', '™', 'String']
['Sentence', 'testString', '™', '™']
['Sentence', '™', '™', 'testString']
['Sentence', 'test', '™', '™', 'String']
['Sentence', 'testString', '™', '™', '™']
['Sentence', '™', '™', '™', 'testString']
['Sentence', 'test', '™', '™', '™', 'String']
['Sentence', '™', 'test', '™', 'String']
['Sentence', 'test', '™', 'String', '™']
['Sentence', '™', 'test', '™', 'String', '™']
['Sentence', 'ᵟ', 'te', 'ᶿ', 'stStr', '™', '™', 'ing', 'ᵝ']
['™', 'testString']
Output (string
format)
Sentence testString ™
Sentence ™ testString
Sentence test ™ String
Sentence testString ™ ™
Sentence ™ ™ testString
Sentence test ™ ™ String
Sentence testString ™ ™ ™
Sentence ™ ™ ™ testString
Sentence test ™ ™ ™ String
Sentence ™ test ™ String
Sentence test ™ String ™
Sentence ™ test ™ String ™
Sentence ᵟ te ᶿ stStr ™ ™ ing ᵝ
™ testString